A Practitioner's Guide to Bayesian Estimation of Discrete ... · gramming (DDP) models using the Bayesian Dynamic Programming algorithm developed in Imai, Jain and Ching (2008) (IJC)

QEDQueen’s Economics Department Working Paper No. 1201

A Practitioner’s Guide to Bayesian Estimation of DiscreteChoice Dynamic Programming Models

Susumu ImaiQueen’s University

Andrew ChingUniversity of Toronto

Masakazu IshiharaUniversity of Toronto

Neelan JainNorthern Illinois University

Department of EconomicsQueen’s University

94 University AvenueKingston, Ontario, Canada

K7L 3N6

4-2009

A Practitioner’s Guide to Bayesian Estimation ofDiscrete Choice Dynamic Programming Models∗

Andrew ChingRotman School of Management

University of Toronto

Susumu ImaiDepartment of Economics

Queen’s University

Masakazu IshiharaRotman School of Management

University of Toronto

Neelam JainDepartment of Economics

Northern Illinois University

This draft: April 5, 2009

∗This is work-in-progress. Comments are welcome.

Abstract

This paper provides a step-by-step guide to estimating discrete choice dynamic pro-gramming (DDP) models using the Bayesian Dynamic Programming algorithm developedin Imai, Jain and Ching (2008) (IJC). The IJC method combines the DDP solution al-gorithm with the Bayesian Markov Chain Monte Carlo algorithm into a single algorithm,which solves the DDP model and estimates its structural parameters simultaneously. Themain computational advantage of this estimation algorithm is the efficient use of informa-tion obtained from the past iterations. In the conventional Nested Fixed Point algorithm,most of the information obtained in the past iterations remains unused in the current it-eration. In contrast, the Bayesian Dynamic Programming algorithm extensively uses thecomputational results obtained from the past iterations to help solving the DDP model atthe current iterated parameter values. Consequently, it significantly alleviates the com-putational burden of estimating a DDP model. We carefully discuss how to implementthe algorithm in practice, and use a simple dynamic store choice model to illustrate howto apply this algorithm to obtain parameter estimates.

1 Introduction

In economics and marketing, there is a growing empirical literature which studies choice

of agents in both the demand and supply side, taking into account their forward-looking

behavior. A common framework to capture consumers or firms forward-looking behavior

is discrete choice dynamic programming (DDP) framework. This framework has been

applied to study manager’s decisions to replace old equipments (e.g., Rust 1987), ca-

reer decision choice (e.g., Keane and Wolpin 1997; Diermier, Merlo and Keane 2005),

choice to commit crimes (Imai and Krishna 2004), dynamic brand choice (e.g., Erdem

and Keane 1996; Gonul and Srinivasan 1996), dynamic quantity choice with stockpiling

behavior (e.g., Erdem, Imai and Keane 2003; Sun 2005; Hendel and Nevo 2006), new

product/technology adoption decisions (Ackerberg 2003; Song and Chintagunta 2003;

Crawford and Shum 2005; Yang and Ching 2008), new product introduction decisions

(Hitsch 2006), etc. Although the framework provides a theoretically tractable way to

model forward-looking incentives, and this literature has been growing, it remains small

relative to the literature that models choice using a static reduced form framework. This

is mainly due to two obstacles of estimating this class of models: (i) the curse of dimen-

sionality problem in the state space, putting a constraint on developing models that match

the real world applications; (ii) the complexity of the likelihood/GMM objective function,

making it difficult to search for the global maximum/minimum when using classical ap-

proach to estimate them. Several studies have proposed different ways to approximate the

dynamic programming solutions, and reduce the hurdle due to the curse of dimensionality

problem (e.g., Keane and Wolpin 1994; Rust 1997; Hotz and Miller 1993; Aguirreagabiria

3

and Mira 2002; Ackerberg 2001).1 Nevertheless, little progress has been made in handling

the complexity of the likelihood function from the DDP models. A typical approach is

to use different initial values to re-estimate the model, and check which set of parameter

estimates gives the highest likelihood value. However, without knowing the exact shape

of the likelihood function, it is often difficult to confirm whether the estimated parameter

vector indeed gives us the global maximum.

In the past two decades, Bayesian Markov Chain Monte Carlo (MCMC) approach

has provided a tractable way to simulate the posterior distribution of parameter vectors

for complicated static discrete choice models, making the posterior mean an attractive

estimator compared with classical point estimates in that setting (Albert and Chib 1993;

McCulloch and Rossi 1994; Allenby and Lenk 1994; Allenby 1994; Rossi et al. 1996;

Allenby and Rossi 1999). However, researchers seldom use the Bayesian approach to

estimate DDP models. The main problem is that the Bayesian MCMC approach typically

requires many more iterations than classical approach to achieve convergence. In each

simulated draw of a parameter vector, the DDP model needs to be solved to calculate

the likelihood function. As a result, the computational burden of solving a DDP model

has essentially ruled out the Bayesian approach except for very simple models, where the

solution of the model can be solved very quickly or there exists a close form solution (e.g.,

Lancaster 1997).

Recently, Imai, Jain and Ching (2008) (IJC) propose a new modified MCMC algorithm1Geweke and Keane (2002) proposed to use a flexible polynomial to approximate the future component

of the Bellman equation. Their approach allowed them to conduct Bayesian inference on the structuralparameters of the current payoff functions and the reduced form parameters of the polynomial approx-imations. However, since it completely avoids solving for the DDP model and fully specifying it, theirestimation results are not efficient and policy experiments cannot be conducted under their approach.

4

to reduce the computational burden of estimating infinite horizon DDP models using

the Bayesian approach. This method combines the DDP solution algorithm with the

Bayesian MCMC algorithm into a single algorithm, which solves the DDP model and

estimates its structural parameters simultaneously. In the conventional Nested Fixed

Point algorithm, most of the information obtained in the past iterations remains unused

in the current iteration. In contrast, the IJC algorithm extensively uses the computational

results obtained from the past iterations to help solving the DDP model at the current

iterated parameter values. This new method is potentially superior to prior methods

because (1) it significantly reduces the computational burden of solving for the DDP model

in each iteration, and (2) it produces the posterior distribution of parameter vectors, and

the corresponding solutions for the DDP model – this avoids the need to search for the

global maximum of a complicated likelihood function.

This paper provides a step-by-step guide to estimating discrete choice dynamic pro-

gramming (DDP) models using the IJC method. We carefully discuss how to implement

the algorithm in practice, and use a simple dynamic store choice model to illustrate how

to apply this algorithm to obtain parameter estimates. Our goal is to reduce the costs of

adopting this new method and expand the toolbox for researchers who are interested in

estimating DDP models. The rest of the paper is organized as follows. In section 2, we

present a dynamic store choice model, where each store offers its own reward programs.

In section 3, we present the IJC method and explain how to implement it to obtain pa-

rameter estimates of this model. We also discuss the practical aspects of using the IJC

method. In section 4 we use the estimation results of the dynamic store choice model to

demonstrate certain properties of the IJC estimation method. Section 5 discusses how

5

to extend IJC algorithm to (i) conduct policy experiments, and (ii) allow for continuous

state variables. We also comment on the choice of kernels. Section 6 is the conclusion.

2 The Model2.1 The Basic Framework

Suppose that there are two supermarkets in a city (j = 1, 2). Each store offers a stamp

card, which can be exchanged for a gift upon completion. Consumers get one stamp for

each visit with a purchase.

Reward programs at the two supermarkets differ in terms of (i) the number of stamps

required for a gift (Sj), and (ii) the mean value of the gift (Gj). Consumers get a gift in

the same period (t) that they complete the stamp card. Once consumers receive a gift,

they will start with a blank stamp card again in the next period.

Let pijt be the price that consumer i pays in supermarket j at time t. We assume

that prices of store j in each period are drawn from an iid normal distribution, N(p, σ2p),

which is common across stores. Also, we assume that this price distribution is known to

consumers. Let sijt ∈ {0, 1, . . . , Sj − 1} denote the number of stamps collected for store

j in period t before consumers make a decision. Note that sijt does not take the value

Sj because of our assumption that consumers get a gift in the same period that they

complete the stamp card.

Consumer i’s single period utility of visiting supermarket j in period t at sit =

(si1t, si2t) is given by

Uijt(sit) ={

αj + γpijt + Gij + εijt if sijt = Sj − 1αj + γpijt + εijt otherwise,

where αj is the consumer loyalty (or brand equity) for store j, γ is the price sensitivity,

6

Gij is consumer i’s valuation of gift for store j, and εijt is the idiosyncratic error term.

We assume εijt is extreme-value distributed. Gij is assumed to be normally distributed

around Gj with the standard deviation σGj. In each period, consumers may choose not

to go shopping. The single period mean utility of no shopping is normalized to zero, i.e.,

Ui0t = εi0t.

Consumer i’s objective is to choose a sequence of store choices to maximize the sum

of the present discounted future utility:

max{dijt}∞t=1

E

[ ∞∑t=1

βt−1dijtUijt(sit)]

where dijt = 1 if consumer i chooses store j in period t and dijt = 0 otherwise. β is the

discount factor. The evolution of state, sit, is deterministic and depends on consumers’

choice. Given the state sijt, the next period state, sijt+1, is determined as follows:

sijt+1 =

sijt + 1 if sijt < Sj − 1 and purchase at store j in period t

0 if sijt = Sj − 1 and purchase at store j in period t

sijt if purchase at store −j or no shopping in period t

(1)

Let θ be the vector of parameters. Also, define si = (si1, si2), pi = (pi1, pi2), and

Gi = (Gi1, Gi2). In state si, the Bellman’s equation for consumer i is given by

V (si, pi; Gi, θ) ≡ Eε max{V0(si; θ) + εi0, V1(si, pi1; Gi, θ) + εi1, V2(si, pi2; Gi, θ) + εi2}

= log(exp(V0(si; θ)) + exp(V1(si, pi1; Gi, θ)) + exp(V2(si, pi2; Gi, θ))),

(2)

where the second equality follows from the extreme value assumption on ε. The alternative-

specific value functions are written as: For j = 1, 2,

Vj(si, pij; Gi, θ) ={

αj + γpij + Gij + βEp′ [V (s′, p′; Gi, θ)] if sij = Sj − 1,αj + γpij + βEp′ [V (s′, p′; Gi, θ)] otherwise.

(3)

V0(si; θ) = βEp′ [V (s′, p′; Gi, θ)]

7

where the state transition from si to s′ follows (1), and the expectation with respect to

p′ is defined as

Ep′ [V (s′, p′; Gi, θ] =∫

V (s′, p′; Gi, θ)dF (p′).

The parameters of the model are αj (store loyalty), Gj (consumers’ mean value of the

gift offered by store j), σGj(standard deviation of Gij), γ (price sensitivity), β (discount

factor), p (mean price common across stores), σp (standard deviation of prices).

Hartmann and Viard (2008) estimated a dynamic model with reward programs that is

similar to the one here. The main differences are (1) we allow for two stores with different

reward programs in terms of (Gj, Sj) while they considered one store (golf club); (2) we

estimate the discount factor (i.e., β) while they fixed it according to the interest rate.

The general dynamics of this model is also more complicated than the one used in IJC for

Monte Carlo exercises. The model here has two endogenous state variables (s1, s2), while

the dynamic firm entry-exit decision model used in IJC has one exogenous state variable

(capital stock). However, IJC consider a normal error term, which is more general than

the extreme value error term we assume here. We consider the extreme value error term

because (1) it is quite common that researchers adopt this distributional assumption when

estimating a DDP model, (2) our analysis here would complement that of IJC.

2.2 Identification

The main dynamics of the model is the intertemporal trade-off created by the reward

program. Suppose that a consumer is closer to the completion of the stamp card for

store 1, but the price is lower in store 2 today. If the consumer chooses store 2 based on

the lower price, he or she will delay the completion of the stamp card for store 1. If the

8

consumer take the future into account, the delay will lower the present discounted value of

the reward. Thus he/she will have an incentive to keep shopping at store 1 even though

prices at store 2 is lower. Moreover, such an incentive should depend on the discount

factor.

This dynamic trade-off suggests that the empirical choice frequency of visiting the

stores across states should allow us to pin down the discount factor. To illustrate this

point, we consider a model of homogeneous consumers with only one store and an outside

option and simulate choice probabilities for different discount factors. In this exercise,

we set α1 = −2, γ = 0, G1 = 3, and S1 = 5. Figure 1 shows how the choice probability

of visiting the store changes across states (no. of stamps collected) for different discount

factors (β = 0, 0.5, 0.75, 0.9, 0.999). When β = 0, the choice probability is purely de-

termined by the single period utility. Thus, the choice probability is flat from s = 0 to

s = 3. At s = 4, consumers receive the gift thus the choice probability jumps up. Another

extreme case is when β is close to one (β = 0.999). Interestingly, the choice probabilities

are essentially constant across states, and higher than those of β = 0 for s = 0, ..., 3. But

at s = 4, the choice probability for β = 0.999 is smaller than that for β = 0.

Figure 2 shows how the choice probabilities at each state change with β. In general, as

we increase β, we observe three predictions: (i) when s = 4, the store choice probability

monotonically decreases; (ii) when s < 3, the store choice probability monotonically

increase; (iii) when s = 3, the store choice probability first increases and then decreases.

Why do we have prediction (i)? First, we note that EpV (s) always increases with β for

all s, but the extent of the increase would differ across s. In particular, EV (s = 4) is

much more responsive than EV (s = 0) to an increase in β. This is because a consumer

9

is much closer to getting a reward at s = 4. This shows that when β increases, the value

of choosing the outside option (depending on EV (s = 4)) will increase faster than that

of choosing to visit the store (depending on EV (s = 0)). As a result, the store choice

probability decreases with β. Prediction (ii) is mainly due to the fact that the incentive to

earn a reward increases with β. To understand prediction (iii), we note that in general an

increase in β has two effects on store choice. On the one hand, it increases the incentive

to earn a reward as the consumer puts more weight in the future. On the other hand, a

larger β also implies that the consumer is more patient, and consequently, becomes less

concern about waiting (i.e., the costs of waiting become smaller). The interaction between

these two forces explains prediction (iii). When β first increases from zero, the former

force (i.e., the incentive to earn a reward earlier) dominates. But as β becomes closer to

1, the later force (i.e., the patience factor) catches up. Lastly, it should be noted that the

choice probabilities become flat when β is very close to one. This is because the difference

between these two forces essentially becomes constant across s, when β approaches one.

The above discussion also suggests that the identification of αj and Gj crucially de-

pends on the value of β. When β is close to 1, it is very hard to identify them separately

because they play very similar role in determining the choice probabilities. But when β

moves away from 1, it becomes much easier to identify them.2

2Hartmann and Viard (2008) also discussed how the discount factor would affect the pattern of choiceprobabilities. However, because they take the intrinsic discount factor as exogenously given (determinedby the interest rate), they argue that such an effect would happen through the “artificial” discount factor,which depends on how frequent a customer visits a store (determined by αj here).

10

2.3 Numerical Solution of the Model

We will consider how to solve the model without consumer heterogeneity in Gij first. Solv-

ing the model with consumer heterogeneity is a straightforward extension. To solve the

model without consumer heterogeneity (i.e., Gij = Gj for all i and j = 1, 2) numerically,

we will

1. Make M draws of {pmj }M

m=1 from the price distribution function, N(p, σ2p), for j =

1, 2. We fix these draws below.

2. Start with an initial guess of the Emax functions, e.g., setting V 0(s, p; θ) = 0,∀s, p.

Suppose that we know V l, where l is the number of past iterations. We will discuss

how to obtain V l+1.

3. Substitute {pm}Mm=1 into V l(s, p; θ), and then take the average across pm’s to obtain

a Monte Carlo approximation of Ep′Vl(s′, p′; θ),∀s′.

4. Substitute these approximated expected future value functions into the Bellman

operator and obtain V l+1(s, p; θ),∀s, p.

5. Repeat step 3-4 until EpVl+1(., p; θ) converges.

For the model with consumer heterogeneity in Gij, we will need to compute V l+1(s, p; Gi, θ)

for each i.

11

3 Estimation Method3.1 IJC algorithm

It is well-known that when using maximum likelihood or Bayesian MCMC to estimate

discrete choice dynamic programming models, the main computational burden is that

the value functions need to be solved in each set of trial parameter vector (for maximum

likelihood), or each set of random draw of parameter vector (for Bayesian MCMC). Since

both procedures, in particular Bayesian MCMC, require many iterations to achieve conver-

gence, a typical nested fixed point algorithm will need to repeatedly apply the re-iteration

procedure outlined above to solve for the value functions.3 As a result, the computational

burden is so large that even a very simple discrete choice dynamic programming model

cannot be estimated using standard Bayesian MCMC methods.

The IJC algorithm relies on two insights to reduce the computational burden of each

MCMC iteration: (1) It could be quite wasteful to compute the value function exactly

before the markov chain converges to the true posterior distribution. Therefore, the IJC

algorithm proposes to “partially” solve for the Bellman equation for each parameter draw

(at the minimum, only apply the Bellman operator once in each iteration). (2) The value

functions evaluated at the past MCMC draws of parameters contain useful information

about the value functions at the current draw of parameters, in particular, for those

evaluated within a neighborhood of the current parameter values. However, the traditional

nested fixed point algorithm hardly makes use of them. Therefore, IJC propose to replace

the contraction mapping procedure of solving the value functions with a weighted average3It should be noted that Bayesian MCMC algorithm generally needs to be run 10,000 to 30,000

iterations to obtain enough draws for the posterior distributions.

12

of the pseudo-value functions obtained as past outcomes of the estimation algorithm. In

IJC, the weight depends on the distance between the past parameter vector draw and

the current one – the shorter the distance, the higher the weight. The basic intuition is

that the value function is continuous in the parameter space. Therefore, it is possible

to use the past value functions to form a non-parametric estimate of the value function

evaluated at the current draw of parameter values. Such a non-parametric estimate is

usually computationally much cheaper than the standard contraction mapping procedure,

in particular for β close to 1. Combining these two insights, IJC dramatically reduce the

computational burden of each iteration in the Bayesian MCMC algorithm. This modified

procedure differs from the standard nested fixed point algorithm in an important aspect:

instead of solving the model and search for parameters alternately, it solves and estimates

the model simultaneously.

In the context of the reward program example without consumer heterogeneity, the

outputs of the algorithm in each iteration r include {θr, EpVr(s, p; θr)}, where V r is the

pseudo-value function. To obtain these outputs, IJC make use of the past outcomes of

the estimation algorithm, Hr = {θl, EpVl(s, p; θl)}r−1

l=r−N . We will now extend Bellman

equations (2)-(3) to illustrate the pseudo-value functions defined in IJC.

The pseudo-value functions are defined as follows. To simplify notations, we drop the

i subscript for s and p. For each s,

V r(s, p; θr) = log(exp(V r0 (s; θr)) + exp(V r

1 (s, p1; θr)) + exp(V r2 (s, p2; θr))), (4)

13

where for j = 1, 2,

V rj (s, pj; θr) =

{αj − γpj + Gj + βEr

p′V (s′, p′; θr) if sj = Nj − 1,αj − γpj + βEr

p′V (s′, p′; θr) otherwise,(5)

V r0 (s; θr) = βEr

p′V (s′, p′; θr).

The approximated Emax functions are defined as the weighted average of the past

pseudo-value functions obtained from the estimation algorithm. For instance, ErpV (s, p; θr)

can be constructed as follows:

ErpV (s, p; θr) =

r−1∑l=r−N

EpVl(s, p; θl) Kh(θr − θl)∑r−1

k=r−N Kh(θr − θk), (6)

where Kh(.) is a Gaussian kernel with bandwidth h. To obtain EpVr(s, p; θr), we simulate

a set of prices, {pm}Mm=1, from the price distribution, N(p, σ2

p), and evaluate V r(s, pm; θr)

using (4) and (5) and,

EpVr(s, p; θr) =

M∑m=1

V r(s, pm; θr), (7)

IJC show that by treating the pseudo-value function as the true value function in a

MCMC algorithm, and applying it to estimate a dynamic discrete choice problem with

discrete state space, the modified MCMC draws of θr converge to the true posterior

distribution uniformly.

It should be highlighted that the approximated Emax function defined in (6) is the key

innovation of IJC. In principles, this step is also applicable in classical estimation methods

such as GMM and Maximum Likelihood.4 However, there are at least several advantages

of implementing IJC’s pseudo-value function approach in Bayesian estimation. First, the

non-parametric approximation in (6) would be more efficient if the past pseudo-value4Brown and Flinn (2006) extend the implementation of this key step in estimating a dynamic model

of marital status choice and investment in children using the method of simulated moments.

14

functions are evaluated at θl’s that are randomly distributed around θr. This can be nat-

urally achieved by the Bayesian MCMC algorithm. On the contrary, classical estimation

methods typically require minimizing/maximizing an objective function. Commonly used

minimization/maximization routines (e.g., BHHH, quasi-Newton methods, etc.) tend to

search over parameter spaces along a particular path. Consequently, we believe that the

approximation step proposed by IJC should perform better under the Bayesian MCMC

approach.5 Second, in the presence of unobserved consumer heterogeneity, it is common

that the likelihood function is multi-modal even for static choice problems. In this sit-

uation, Bayesian posterior means often seem to be better estimators of true parameter

values than are classical point estimates. This is because in practice, accurately simulat-

ing a posterior is in many cases much easier than finding the global maximum/minimum

of a complex likelihood/GMM objective function.

Now we turn to discuss how to implement the IJC method to estimate the dynamic

store choice model present here. We consider two versions of the model: (i) without

unobserved consumer heterogeneity, (ii) with unobserved consumer heterogeneity.

3.2 Implementation of the IJC algorithm

In this subsection, we discuss how to estimate the dynamic store choice model using the

IJC algorithm. The steps are similar to the standard Metropolis-Hastings algorithm,

except that we use (6) to calculate the Emax. Let Ibuy,ijt be an indicator function for5A stochastic optimization algorithm, simulated annealing, has recently gained some attention to han-

dle complicated objective function. This algorithm is an adaptation of the Metropolis-Hastings algorithm(Kirkpatrick et al. 1983; Cerny 1985). The IJC method should also be well-suited when incorporatingwith simulated annealing for the classical estimation approach. However, we should note that beforea researcher starts the estimation, this method requires him/her to choose a “cooling” rate. The idealcooling rate cannot be determined a priori. In the MCMC-based Bayesian algorithm, one does not needto deal with this nuisance parameter.

15

purchasing at store j by consumer i at time t, and pit = (pi1t, pi2t) be the price vector for

consumer i at store j at time t. We use (Idbuy,ijt, p

dijt) to denote the observed data.

Our focus is to provide a step-by-step implementation guide and discuss the practi-

cal aspects of the IJC algorithm. Once the readers understand implementation of the

IJC algorithm in this simple example, they should be able to extend it to other more

complicated settings.

3.2.1 Homogeneous Consumers

We first present the implementation of the IJC algorithm when consumers are homoge-

neous in their valuations of Gj (i.e., σGj= 0 for j = 1, 2). The vector of parameters to

be estimated is θ = (α1, α2, G1, G2, γ, β).

The IJC algorithm generates a sequence of MCMC draw of θr, r = 1, 2, .... The

algorithm modifies the Metropolis-Hastings, and involves drawing a candidate parameter

θ∗r in each iteration. For each iteration (r),

1. Start with a history of past outcomes from the algorithm,

Hr = {{θ∗l, EpVl(., p; θ∗l)}r−1

l=r−N , ρr(θr−1)}

where N is the number of past iterations used for Emax approximation; EpVl is the

expected pseudo-value functions w.r.t. p; ρr(θr−1) is the pseudo-likelihood of the

data conditional on ErpV (., p; θr−1).

2. Draw θ∗r (candidate parameter vector) from a proposal distribution q(θr−1, θ∗r).

3. Compute the pseudo-likelihood conditional on θ∗r based on the approximated Emax

functions. Then we determine whether or not to accept θ∗r based on the acceptance

16

probability, min( π(θ∗r)·ρr(θ∗r)·q(θ∗r,θr−1)π(θr−1)·ρr(θr−1)·q(θr−1,θ∗r) , 1). Essentially, we apply the Metropolis-

Hastings algorithm here by treating the pseudo-likelihood as the true likelihood.

If accept, set θr = θ∗r; otherwise, set θr = θr−1. To obtain ρr(θ∗r), we need to

calculate ErpV (., p; θ∗r), which is obtained using the weighted average of the past

expected pseudo-value functions: {EpVl(., p; θ∗l)}r−1

l=r−N . The weight of each past

expected pseudo-value function is determined by Gaussian kernels.

ErpV (s, p; θ∗r) =

r−1∑l=r−N

EpVl(s, p; θ∗l) Kh(θ∗r − θ∗l)∑r−1

k=r−N Kh(θ∗r − θ∗k).

4. Computation of pseudo-value function, EpVr(s, p; θ∗r).

(a) Make M draws of prices, {pm}Mm=1, from the price distribution.

(b) Compute V r0 (s; θ∗r), V r

1 (s, pm1 ; θ∗r) and V r

2 (s, pm2 ; θ∗r), using the approximated

Emax functions computed in Step 3.

(c) Given V r0 (s; θ∗r), V r

1 (s, pm1 ; θ∗r) and V r

2 (s, pm2 ; θ∗r), obtain the pseudo-value

function, V r(s, pm; θ∗r) (see (4)). By averaging V r(s, pm; θ∗r) across pm’s, we

integrate out prices and obtain EpVr(s, p; θ∗r).

5. Compute the pseudo-likelihood, ρr+1(θr), based on a updated set of past pseudo-

value functions.6

6. Go to iteration r + 1.

Note that when implementing the IJC algorithm, we propose to store {θ∗l, EpVl(s, p; θ∗l)}r−1

l=r−N

instead of {θl, EpVl(s, p; θl)}r−1

l=r−N . If we store {θl, EpVl(s, p; θl)}r−1

l=r−N , there may be a6Two points should be noted here. First, in each iteration, there is a new pseudo-value function

evaluated at θ∗r. Second, this step needs not be carried out in a full-solution based Bayesian MCMCalgorithm.

17

significant portion of θl’s that are repeated because the acceptance rate of the M-H step

is usually set at around 13 . In order to conduct the non-parametric approximation for the

expected future value, it is useful to have a set of EpVl’s that span the parameter spaces.

Since θ∗l’s are drawn from a candidate generating function, it is much easier for us to

achieve this goal by storing EpVl’s at θ∗l’s. Moreover, for each candidate parameter draw,

θ∗r, we need to evaluate the expected future payoffs, ErpV (., p; θ∗r), to form the likelihood.

As we have shown above, it will only take an extra step to obtain EpVr(., p; θ∗r). So

storing {EpVr(., p; θ∗l)}r−1

l=r−N will impose little extra costs.

The above description of the implementation of the algorithm is slightly different from

IJC, who propose to draw only one unobserved error term in each iteration. In the current

application, that would be equivalent to drawing one price shock in each iteration, pl, and

store Hr = {{θ∗l, V l(., pl; θ∗l)}r−1l=r−N , ρr(θr−1)}. The main difference between these two

approaches is that the original IJC proposes to store V l instead of EpVl, and the Emax

approximation is,

ErpV (s, p; θ∗r) =

r−1∑l=r−N

V l(s, pl; θ∗l) Kh(θ∗r − θ∗l)∑r−1k=r−N Kh(θ∗r − θ∗k)

.

The advantage of the original approach is that it saves some time in computing EpVr.

However, we believe that the approach proposed here would allow us to to achieve the

same level of precision in terms of integrating out prices by using a smaller N . As a result,

it would also save us some time in computing the weighted average.

We should also note that in the present example where we assume prices are observed,

one can use the observed prices as random realizations in computing EpVr(s, p; θr), pro-

vided that there are sufficient number of observations for each s. The advantage of using

18

the observed prices is that the pseudo-value functions are by-products of the likelihood

function computation. However, this is only possible when we have a reasonably large

number of observations for each state.

We also note that in step 5, it may not be worthwhile to compute the pseudo-likelihood,

ρr+1(θr), in each iteration. If we accept θ∗r, ρr(θ∗r), which is calculated in the M-H step

(i.e., step 3), can be used as a proxy for ρr+1(θr), which is computed in step 5. Note that

their calculations only differ in terms of one past pseudo-value function. If we reject θ∗r,

we set θr = θr−1. So we could use ρr(θr−1) as a proxy for ρr+1(θr), and only compute

ρr+1(θr) once every several successive rejections. Our experiences suggest that we can

obtain a fairly decent gain in computational time if we use this approach.

3.2.2 Heterogeneous Consumers

We now present the implementation of the IJC algorithm when consumers have hetero-

geneous valuations for the reward (σGj> 0). The vector of parameters to be estimated

is θ = (θ1, θ2), where θ1 = (α1, α2, γ, β) and θ2 = (G1, G2, σG1 , σG2). We incorporate the

Bayesian approach for random-coefficient models into the estimation steps for homoge-

neous case. We use a normal prior on Gj and an inverted gamma prior on σGj.

We use the Metropolis-Hastings algorithm to draw Glij, which is consumer i’s valuation

of reward at store j from the population distribution of Gij. This draw is regarded as an

individual specific parameter. As a result, conditional on Gi, the value functions do not

depend on θ2.

Each MCMC iteration mainly consists of three blocks.

(i) Draw Grj ∼ fG(Gj|σr−1

Gj, {Gr−1

ij }Ii=1) and σr

Gj∼ fσG

(σrGj|Gr

j , {Gr−1ij }I

i=1) for j = 1, 2

19

(the parameters that capture the distribution of Gij for the population) where fG

and fσGare the posterior distributions.

(ii) Draw individual parameters Grij by the independent Metropolis-Hastings algorithm

with N(Grj , (σr

Gj)2) as a proposal distribution for all i and j = 1, 2.

(iii) Draw θr1 ∼ fθ1(.|Gr

i ) using the random-walk Metropolis-Hastings algorithm.

The estimation steps are as follows. Step 2-3 belong to block (i), step 4 belongs to

block (ii) and step 5 belongs to block (iii). For each MCMC iteration (r),

1. Start with

Hr = {{θ∗l, Gli∗ , EpV

l(., p; Gli∗ , θ

∗l1 )}r−1

l=r−N , {ρri (Gr−1

i , θr−11 )}I

i=1}

where I is the number of consumers; N is the number of past iterations used for

Emax approximation; i∗ = r − I ∗ int( r−1I

) where int(.) is an integer function that

converts any real number to an integer by discarding its value after the decimal

place.

2. Draw Grj (population mean of Gij) from the posterior density (normal) computed

by σr−1Gj

and {Gr−1ij }I

i=1.

3. Draw σrGj

(population variance of Gij) from the posterior density (inverse gamma)

computed by Grj and {Gr−1

ij }Ii=1.

4. For each i, use the Metropolis-Hastings algorithm to draw Grij.

(a) Use the prior on Gij (i.e., N(Grj , (σr

Gj)2)) as the proposal distribution function

to draw G∗rij .

20

(b) Compute the likelihood for consumer i at G∗rij , i.e., ρr

i (G∗rij , θr−1

1 ) and determine

whether or not to accept G∗ri . The acceptance probability, λ, is given by

λ = min(

π(G∗rij ) · ρr

i (G∗rij , θr−1

1 ) · q(G∗rij , Gr−1

ij )π(Gr−1

ij ) · ρri (Gr−1

ij , θr−11 ) · q(Gr−1

ij , G∗rij )

, 1)

= min(

ρri (G∗r

ij , θr−11 )

ρri (Gr−1

ij , θr−11 )

, 1)

where the second equality follows from the fact that the proposal distribution is

set to the prior on Gij, i.e., q(x, y) = π(y). Let Gri be the one that is accepted.

Note that G∗rij will only affect Er

pV (s, p; G∗ri , θr−1

1 ) for consumer i and not for

other consumers. Thus we only need to approximate the Emax functions for

consumer i, using the weighted average of {EpVl(s, p; Gl

i∗ , θ∗l1 )}r−1

l=r−N , treating

Gi as one of the parameters when computing the weights. In the case of

independent kernels, Emax approximation is,

ErpV (s, p; G∗r

i , θr−11 ) =

r−1∑l=r−N

EpVl(s, p; Gl

i∗ , θr−11 ) Kh(θr−1

1 − θ∗l1 )Kh(G∗ri −Gl

i∗)∑r−1k=r−N Kh(θr−1

1 − θ∗k1 )Kh(G∗ri −Gk

i∗).7

(c) Repeat for all i.

5. Use the Metropolis-Hastings algorithm to draw θr1 = (αr

1, αr2, γ

r, βr) conditional on

Grij.

(a) Draw θ∗r1 = (α∗r1 , α∗r2 , γ∗r, β∗r) (candidate parameter vector).

(b) We then compute the likelihood conditional on (α∗r1 , α∗r2 , γ∗r, β∗r) and {Gri}I

i=1,

based on the approximated Emax functions, and determine whether or not to

accept α∗r1 , α∗r2 , γ∗r, and β∗r. The Emax approximation is described as follows.7Note that {Kh(θr−1

1 −θ∗l1 )}r−1l=r−N is common across consumers. Therefore, one can calculate it outside

the loop when programming this part.

21

For each i and s, ErpV (s, p; Gr

i , θ∗r1 ) is obtained using the weighted average of

the past value functions, {EpVl(s, p; Gl

i∗ , θ∗l1 )}r−1

l=r−N . In computing the weights

for past value functions, we treat Gi as a parameter. Note that in the case of

independent kernels, equation (6) becomes

ErpV (s, p; Gr

i , θ∗r1 ) =

r−1∑l=r−N

EpVl(s, p; Gl

i∗ , θ∗l1 ) Kh(θ∗r1 − θ∗l1 )Kh(Gr

i −Gli∗)∑r−1

k=r−N Kh(θ∗r1 − θ∗k1 )Kh(Gri −Gk

i∗).8

6. Computation of pseudo-value function, EpVr(s, p; Gr

i∗ , θ∗r1 ).

(a) Make M draws of prices, pm, from the price distribution.

(b) Compute V r0 (s; θ∗r1 ), V r

1 (s, pm1 ; Gr

i∗ , θ∗r1 ) and V r

2 (s, pm2 ; Gr

i∗ , θ∗r1 ), using the ap-

proximated Emax functions computed in Step 5 (c).

(c) Given V r0 (s; θ∗r1 ), V r

1 (s, pm1 ; Gr

i∗ , θ∗r1 ) and V r

2 (s, pm2 ; Gr

i∗ , θ∗r1 ), obtain the pseudo-

value function, V r(s, pm; Gri∗ , θ

∗r1 ). By averaging V r(s, pm; Gr

i∗ , θ∗r1 ) across pm’s,

we integrate out prices and obtain EpVr(s, p; Gr

i∗ , θ∗r1 ).

7. Compute the pseudo-likelihood, ρr+1i (Gr

i , θr1) ∀i, based on a updated set of past

pseudo-value functions.

8. Go to iteration r + 1.

In step 1 of the procedure described above, we pick one consumer in each iteration and

store his/her pseudo-value function. Then, we use this pooled set of past pseudo-value

functions across consumers to approximate the emax functions for all consumers. This

is slightly different from IJC, who originally propose to store an individual-specific set of8Note that {Kh(θ∗r1 − θ∗l1 )}r−1

l=r−N is common across consumers. Therefore, one can compute it outsidethe loop that indexes consumers to save computational time.

22

past pseudo-value functions for each consumer. That is, in each iteration we store Hr =

{θ∗l, {Gli, EpV

l(., p; Gli, θ

∗l)}Ii=1}r−1

l=r−N , and use {EpVl(., p; Gl

i, θ∗l)}r−1

l=r−N to approximate

consumer i’s Emax function. One advantage of this approach is that the past pseudo-value

functions used in the emax function approximation are more relevant to each consumer i,

because they are evaluated at Gli’s, which represent the posterior distribution of consumer

i’s value for the gift, and should be closer to G∗ri . Note that this is not the case when we

pool past pseudo-value functions across consumers because different consumers may have

very different values of Gi. This suggests that if we store past pseudo-value functions

individually, we may be able to reduce N in order to achieve the same level of precision

for the emax approximation. This in turn should reduce the computation time. But

one drawback is that we need much more memory to store past pseudo-value functions

individually, although this may not be an important concern given the price of computer

memory has been decreasing rapidly over time.

We only describe steps 2 and 3 briefly here. For the details of these two steps, we refer

readers to Train (2003). When implementing step 5, it could be more efficient to separate

them by blocks if the acceptance rate is low. The trade-off is that when implementing this

step by blocks, we might also increase the number of expected future value approximation

calculations and likelihood evaluations.

3.3 Choice of kernel’s bandwidth and N

The IJC method relies on classical non-parametric methods to approximate the Emax

functions using the past pseudo-value functions generated by the algorithm. One prac-

tical problem of nonparametric regression analysis is that the data becomes increasingly

23

sparse as the dimensionality of the explanatory variables (x) increases. For instance, ten

points that are uniformly distributed in the unit cube are more scattered than ten points

distributed uniformly in the unit interval. Thus the number of observations available to

provide information about the local behavior of an arbitrary regression function becomes

small with large dimension. The curse of dimensionality of this non-parametric technique

(in terms of number of parameters) could be something that we need to worry about.9

The root of this problem is due to the bias-variance trade-off. In general, when the ker-

nel bandwidth is small, the effective number of sample points available around x that

influence the prediction would be small, making the prediction highly sensitive to that

particular sample, i.e., yielding to high variance. When the kernel bandwidth is large, the

prediction becomes overly smooth, i.e., yielding to high bias.

However, in implementing the IJC algorithm, the nature of this problem is different

from the standard non-parametric estimation. Unlike a standard estimation problem

where an econometrician cannot control the sample size of the data set, we can control

the sample size for our nonparametric regressions by storing/using more past pseudo-

value functions (i.e., increasing N). This is similar to the advantage of using the standard

MCMC method to draw from the posterior distribution – the econometrician can control

the number of iterations that requires to obtain convergence. Thus in practice, we expect

that N may need to increase with the number of parameters in the model. As a result, it

would also take more time to compute one iteration.

The discussion above suggests that the convergence rate is typically inversely related9Note that this curse of dimensionality problem is different from that of solving for a dynamic pro-

gramming model, where it refers to the size of the state space increasing exponentially with the numberof state variables and the number of values for each state variable.

24

to the number of dimensions. But the situation that we face now is more subtle for two

reasons. First, it is likely that the convergence rate is model specific, as the shape of the

likelihood function is also model specific. Second, it should also depend on the data sample

size. In general, when estimating a well-identified model with a data set with sufficient

variation, the posterior variance of the parameters decreases with the sample size. This

suggests that when the MCMC converges, the simulated parameter values would move

within a small neighborhood of the posterior means. This implies that the set of past

pseudo-value functions would be evaluated at parameter vectors that are concentrated in

a small neighborhood. We expect that this should alleviate the curse of dimensionality

problem.

It is worth discussing the impact of N on the estimation results. One implication is

that as we increase N , older past pseudo-value functions will be used in the approximated

Emax functions computation. This may result in slow improvements of the approximated

emax values, and may slow down the speed of the MCMC convergence. As we decrease

N , only more recent and accurate past pseudo-value functions will be used in the emax

approximation. However, since the number of the past pseudo-value functions itself be-

comes smaller, the variance of the approximated emax values will increase. This may

result in a higher standard deviation of the posterior distribution for some parameters.

One way of mitigating this trade-off is to set N to be small at the beginning of the IJC

algorithm and let N increase during the MCMC iterations. In this way, we can achieve a

faster convergence and more stable posterior distributions at the same time. Another way

to address this issue is to weight the past N pseudo-value functions differently so that the

more recent pseudo-value functions receive higher weights (because they should be more

25

accurate approximations). In one Monte Carlo experiment that we conduct in the next

section, we show some evidence about the impact of N on the estimation results.

An obvious question that would likely come to researchers’ mind is: How do we choose

N and the bandwidth (h)? We believe that any suggested guidelines should ensure that

the pseudo-value function gives us a good proxy for the true value function. We suggest

that researchers check the distance between the pseudo-value function and the true value

function during the estimation, and adjust N and h within the iterative process. For

instance, researchers can store a large set of past pseudo-value functions (i.e., large N),

and use the most recent N ′ < N of them to do the approximation. This has the advantage

that researchers can immediately increase N ′ if they discover that the approximation is not

good enough. Researchers can start the algorithm with a small N ′, (say N ′ = 100), and

an arbitrary bandwidth (say 0.01). Every 1000 iterations, they can compute the means of

the MCMC draws, ¯θ, and then compare the distance between the pseudo-value function

and the exact value function at ¯θ. If the distance is larger than what the researcher would

accept, increase N ′. Then use N ′ past pseudo-value functions to compute summary

statistics and use standard optimal bandwidth formula (e.g., Silverman’s rule of thumb;

Silverman 1986, p.48) to set h. Of course, the cost of storing a large number of past

pseudo-value function is that it requires more memory. But again thanks to the advance

of computational power, the cost of memory is decreasing rapidly over time these days.

Hence, we expect that memory would become less of a constraint in the near future. This

suggestion would require us to solve for the DDP model exactly once every 1000 iterations.

For complicated DDP models with random coefficients, this could still be computationally

costly. But even in this case, one could simply compare the pseudo-value function and

26

the exact value function at a small number of simulated heterogeneous parameter vectors,

say 5. This would be equivalent to solving 5 homogeneous DDP models numerically and

should be feasible even for complicated DDP models.

4 Estimation Results

To illustrate how to implement the IJC algorithm and investigate some of its properties,

we conduct three Monte Carlo experiments. For each experiment, the simulated sample

size is 1,000 consumers and 100 periods. We use the Gaussian kernel to weigh the past

pseudo-value functions when approximating the Emax functions. The total number of

MCMC iterations is 10,000, and we report the posterior distributions of parameters based

on the 5,001-10,000 iterations. The sample size is 1,000 consumers for 100 periods. For

all experiments, the following parameters are fixed: S1, S2, p = 1.0, and σp = 0.3.

In the first experiment, we are interested in estimating a version of the model without

unobserved heterogeneity. When simulating the data, we set S1 = 2, S2 = 4, σG1 = σG2 =

α1 = α2 = 0, G1 = 1.0, G2 = 5.0, γ = −1.0, and β = 0.6 or 0.8. Our goal is to estimate

α1, α2, G1, G2, γ, and β, treating other parameters as known. To ensure that β < 1, we

transform it as β = 11+exp(φ) For all parameters, flat prior is used. Moreover, we use a

random-walk proposal function. Table 1 summarizes the estimation results, and Figure 3

plots the simulated draws of parameters for the case of β = 0.8. The posterior means and

standard deviations show that the IJC algorithm is able to recover the true parameter

values well. Moreover, it appears that the MCMC draws converge after 2,000 iterations.

In the second experiment, we estimate a version of the model with unobserved het-

erogeneity. For simplicity, we only allow for consumer heterogeneity in G2 (i.e., σG1 = 0).

27

The data is simulated based on the following parameter values: α1 = α2 = 0.0, G1 = 1.0,

G2 = 5.0, σG1 = 0.0, σG2 = 1.0, γ, and β = 0.6 or 0.8. Again, we transform β by the logit

formula, i.e., β = 11+exp(φ) . Our goal is to estimate α1, α2, G1, G2, σG2 , γ, and β, treating

other parameters as known. For α1, α2, G1, γ, and φ, we use flat prior. For G2, we use

a diffuse normal prior (i.e., setting the standard deviation of the prior to ∞). For σG2 ,

we use a diffuse inverted gamma prior, IG(ν0, s0) (i.e., setting s0 = 1, ν0 → 1). Table 2

shows the estimation results, and Figure 4 plots the simulated draws of parameters for

β = 0.8. The IJC algorithm again is able to recover the true parameter values well. The

MCMC draws appear to converge after 2,000 iterations for most of the parameters except

G1, which takes about 3,000 iterations to achieve convergence.

To learn more about the potential gain of IJC in terms of computational time, we

compute the time per iteration and compare IJC’s Bayesian MCMC algorithm with the

full solution based Bayesian MCMC algorithm for both homogeneous model and heteroge-

neous model. In the full solution based Bayesian algorithm, we use 100 simulated draws of

prices to integrate out the future price. For each model, we study three cases: β = 0.6, 0.8

and 0.98. Table 3 summarizes the results based on the average computation time. The

computation time is based on a C program running in a linux workstation with Intel Core

2 Duo E4400 2GHz processor. Note that in the full solution based Bayesian algorithm,

the computation time will increase as the discount factor becomes larger. This is because

the number of steps required for convergence in a contraction mapping increases with the

discount factor (i.e., the modulus). However, the computation time will not be influenced

by the discount factor in the IJC algorithm. In the homogeneous model, the computation

for the full solution based Bayesian is faster for β = 0.6 and 0.8. This is because (i) when β

28

is small, solving for a contraction mapping to get the exact Emax values is not that costly

compared with computing the weighted emax values based on 1,000 past pseudo-value

functions; (ii) full-solution based Bayesian approach does not need to perform step 5 in

the homogeneous case, and step 7 in the heterogeneous case.10 However, when β = 0.98,

IJC algorithm is 40% faster than the full solution algorithm. In the heterogeneous model,

we can see the advantage of the IJC algorithm much clearer. When β = 0.6, the IJC

algorithm is 50% faster than the full solution based Bayesian algorithm; when β = 0.8,

it is about 200% faster; when β = 0.98, it is about 3000% faster. In particular, it is

clear that average computational time per iteration basically remains unchanged in the

IJC algorithm. For the full solution based method, the computational time per iteration

increases exponentially in β because, roughly speaking, we need to solve for the DDP

model for each individual. If there are 1,000 individuals, the computational time will

then be roughly (time per contraction mapping) X 1,000. For the heterogeneous model,

with β = 0.98, it would take about 70 days to run the full solution based Bayesian MCMC

algorithm for 10,000 iterations.11 Using the IJC’s Bayesian MCMC algorithm, it would

take less than 2.5 days to obtain 10,000 iterations.

As discussed above, one issue in using the IJC is how to choose N , the number of the

past pseudo-value functions. In the third Monte Carlo experiment, using the homogeneous

model with β = 0.98, we investigate how changes in N influence the speed of convergence

and the posterior distributions. Note that we use a high discount factor here. This is

because, as we discussed in section 3.3, when the discount factor is large, the number10In this exercise, we perform step 5 in the homogeneous case and step 7 in the heterogeneous case

every time a candidate parameter vector is rejected.11Depending on the convergence rate, the number of iterations required for Bayesian estimation could

be higher than 10,000.

29

of past pseudo-value functions used for the Emax function approximation becomes more

important. Thus, changes in N will have more impacts on the speed of convergence and

the posterior distributions than when the discount factor is small. We simulate the data

given the following set of parameter values: S1 = 5, S2 = 10, α1 = α2 = 0, G1 = 1,

G2 = 10, γ = −1, p = 1.0, and σp = 0.3. Our goal is to compare the performance of the

IJC algorithm using N = 100 and N = 1000. Table 4 shows the posterior distributions of

the parameters. The results show that the posterior means are very similar for both cases.

But the standard deviations for G1 and G2 are smaller for N = 1000. This is consistent

with our arguments earlier in section 3.3 – when using more pseudo-value functions to

do the approximation, the variance of the approximation should become smaller. To see

how the speed of convergence changes with N , we plot the MCMC samplers for α1 and

α2 in Figure 5, and G1 and G2 in Figure 6. It can be seen that when N = 100, the speed

of convergence is faster, but the paths also fluctuate more. Again, this is consistent with

our discussion in section 3.3.

Note that when β = 0.98, the true parameter values are recovered less precisely, in

particular, αj and Gj. This is due to the identification problem that we discussed before

– when the discount factor is large, it does not matter much when a consumer receives

the gift. As a result, Gj would simply shift the choice probabilities, similar to the way

that αj does.

We now turn to discuss how to extend the IJC algorithm to (i) conduct policy exper-

iments, and (ii) allow for continuous state space. We will also comment on the choice of

kernels.

30

5 Extensions5.1 Conducting Policy Experiments

The output of the IJC algorithm is posterior distribution for the parameters of the model,

along with a set of value function (and emax function) estimates associated with each pa-

rameter vector. What if we are interested in a policy experiment that involves changing

a policy parameter by certain percentage (e.g., increase the cost of entry by 100t percent-

age), such that the new parameter vectors do not belong to the support of the posterior

distribution? It would appear that the IJC algorithm does not provide solutions of the

dynamic programming problem evaluated at those policy parameter vectors. In fact, this

limitation would apply even one uses full-solution based Bayesian MCMC algorithm.

Here we propose a minor modification of the IJC algorithm so that we can obtain

the value functions of the new policy parameters as part of the estimation output as

well. Suppose that the researcher is interested in the effect of changing θi to θi, where

θi = (1+ t)θi and θi is the ith element in the parameter vector θ. The modified procedure

needs to store the following additional information: {θ∗l, EpVl(., p; θ∗l)}r−1

l=r−N , where θ∗li =

(1 + t)θ∗li and θ∗l−i = θ∗l−i.

Once the MCMC samplers converge, we will have {θl}Ll=1 as well as {θ∗l, EpV

l(., p; θ∗l)}Ll=L−N+1

as the outputs, where L is the total number of MCMC iterations. To do the policy exper-

iment, (i) take the last M draws of θr, and set the draw of the policy parameter vector as

follows: θr−i = θr

−i and θri = (1 + t)θr

i ; (ii) use {EpVl(., p; θ∗l)}L

l=L−N+1 to form an Emax

at θr, and then obtain the value function and the choice probabilities at θr for each r.

This procedure will increase the computational burden of each iteration due to the

31

calculation of the approximated Emax at θ∗r. However, it is relatively straightforward

to implement, and requires very little extra programming effort. To obtain some insights

about how much more time it would take to include the results for a policy experiment,

we break down the computation time of each iteration into four components based on

our model with unobserved heterogeneity: (i) Emax approximation at θ∗r, (ii) likelihood

evaluation at θ∗r, (iii) Emax approximation at θr based on an updated set of pseudo-

value functions, (iv) likelihood evaluation at θr based on an updated set of pseudo-value

functions. The results are shown in Table 5. Notice that to conduct the policy experiment,

we only need to compute the Emax approximation at θ∗r, and store {EpVl(., ; p, θ∗l)}r−1

l=r−N .

The steps used to calculate the Emax functions at θ∗r are the same as those calculating

the Emax functions at θ∗r. Thus the additional computational time will be the same

as that for step (iii) above, which constitutes about 40-60% of the computation time per

iteration. This indicates that it would roughly increase the computational time by 40-60%

if we use the IJC algorithm to conduct a policy experiment as well.

Finally, we note that there is a limitation of this procedure: we need to know the

magnitude of the change in the policy parameter before seeing the estimation results.

Sometimes researchers may not be able to determine this until they obtain the parameter

estimates.

5.2 Continuous State Space

The state space of the dynamic store choice model described earlier is the number of

stamps collected, which takes a finite number of values. In many marketing and eco-

nomics applications, however, we have to deal with continuous state variables such as

32

prices, advertising expenditures, capital stocks, etc. IJC also describes how to extend the

algorithm to combine with the random grid approximation proposed by Rust (1997). To

illustrate how it works, we consider the homogeneous model here.

Consider a modified version of the dynamic store choice model without unobserved

consumer heterogeneity. Suppose that prices set by the two supermarkets follow a first-

order Markov process (instead of an iid process across time): f(p′|p; θp), where θp is the

vector of parameters for the price process. In this setting, the expected value functions in

equation (5) are conditional on current prices, Ep′ [V (s′, p′; θ)|p]. In the Rust random grid

approximation, we evaluate this expected value function as follows. We randomly sample

M grid points, pm = (pm1 , pm

2 ) for m = 1, . . . , M . Then we evaluate the value functions

at each pm and compute the weighted average of the value functions, where weights are

given by the conditional price distribution.

For each iteration r, we can make one draw of prices, pr = (pr1, p

r2), from a distribution.

For example, we can define this distribution as uniform on [p, p]2 where p and p are the

lowest and highest observed prices, respectively. Then, we compute the pseudo-value

function at pr, V r(s, pr; θr) for all s. Thus, Hr in step 1 of section 3.2.1 needs to be

changed to

Hr = {{θ∗l, pl, V l(., pl; θ∗l)}r−1l=r−N , ρr−1(θr−1)}.

The expected value function given s′, p, and θr is then approximated as follows.

Erp′ [V (s′, p′; θr)|p] =

r−1∑l=r−N

V l(s′, pl; θ∗l) Kh(θr − θ∗l)f(pl|p; θp)∑r−1k=r−N Kh(θr − θ∗k)f(pk|p; θp)

. (8)

Unlike the Rust random grid approximation which fix the number of grid points through-

out the estimation, the random grid points her change at each MCMC iteration. In addi-

33

tion, the total number of random grid points can be made arbitrarily large by increasing

N .

The procedure for obtaining the pseudo-value function in step 4 of section 3.2.1 needs

to be modified slightly. We store V r(s, pr; θ∗r) instead of EpVr(s, p; θ∗r). Specifically, the

pseudo-value function at pr (and θ∗r) is computed as follows. For each s,

V r(s, pr; θ∗r) = log(exp(V r0 (s; θ∗r)) + exp(V r

1 (s, pr1; θ∗r)) + exp(V r

2 (s, pr2; θ∗r))),

where

V rj (s, pr

j ; θ∗r) ={

αj − γprj + Gj + βEr

p′ [V (s′, p′; θ∗r)|pr] if sj = Sj − 1,αj − γpr

j + βErp′ [V (s′, p′; θ∗r)|pr] otherwise,

V r0 (s; θ∗r) = βEr

p′ [V (s′, p′; θ∗r)|pr].

The approximated Emax functions above are computed by equation (8).

Note that if we simply apply the Rust random grid approximation with M grid points

in the IJC algorithm, we need to compute the pseudo-value functions at M grid points

in each iteration. Also, the integration with respect to prices requires us to first compute

the approximated value function at each grid point and then take the weighted average

of the approximated value functions. The computational advantage of the IJC random

grid algorithm described above comes from the fact that we only need to compute the

pseudo-value function at one grid point, pr, in each iteration, and the integration with

respect to prices can be done without approximating the value functions at a set of grid

points separately.

34

5.3 Choice of Kernels

It should be noted that there are many kernels that one could use in forming a non-

parametric approximation for the Emax function. IJC discuss their method in terms of the

Gaussian kernel. Norets (2008) extends IJC’s method by approximating the emax function

using the past value functions evaluated at the “nearest neighbors,” and allowing the error

terms to be serially correlated. At this point, the relative performances of different kernels

in this setting are still largely unknown. It is possible that for models with certain features,

the Gaussian kernel performs better than other kernels in approximating the pseudo-value

function, while other kernels may outperform the Gaussian kernel for models with other

features. More research is needed to document the pros and cons of different kernels, and

provide guidance in the choice of kernel when implementing the IJC method.

6 Conclusion

In this paper, we discuss how to implement the IJC method using a dynamic store choice

model. For illustration purpose, the specification of the model is relatively simple. We

believe that this new method is quite promising in estimating DDP models. Osborne

(2007) has successfully applied this method to estimate a much more detailed consumer

learning model. The IJC method allows him to incorporate much more general unobserved

consumer heterogeneity than the previous literature, and draw inference on the relative

importance of switching costs, consumer learning and consumer heterogeneity in explain-

ing customers persistent purchase behavior observed in scanner panel data. Ching et al.

(2009) have also successfully estimated a learning and forgetting model where consumers

are forward-looking.

35

Bayesian inference has allowed researchers and practitioners to develop more realistic

static choice models in the last two decades. It is our hope that the new method presented

here and its extensions would allow us to take another step to develop more realistic

dynamic choice models and ease the burden of estimating them in the near future.

36

References

Ackerberg, Daniel A. 2001. A New Use of Importance Sampling to Reduce Computational

Burden in Simulation Estimation. Working paper, Department of Economics, UCLA.

Ackerberg, Daniel A. 2003. Advertising, Learning, and Consumer Choice in Experience

Good Markets: An Empirical Examination. International Economic Review 44(3)

1007–1040.

Aguirreagabiria, Victor, Pedro Mira. 2002. Swapping the Nested Fixed Point Algorithm:

A Class of Estimators for Discrete Markov Decision Models. Econometrica 70(4) 1519–

1543.

Albert, James H., Siddhartha Chib. 1993. Bayesian Analysis of Binary and Polychotomous

Response Data. Journal of the American Statistical Association 88 669–679.

Allenby, Greg M. 1994. An Introduction to Hierarchical Bayesian Modeling. Tutorial

Notes, Advanced Research Techniques Forum, American Marketing Association.

Allenby, Greg M., Peter J. Lenk. 1994. Modeling Household Purchase Behavior with

Logistic Normal Regression. Journal of the American Statistical Association 89 1218–

1231.

Brown, Meta, Christopher J. Flinn. 2006. Investment in Child Quality Over Marital

States. Working paper, Department of Economics, New York University.

Cerny, V. 1985. Thermodynamical Approach to the Travelling Salesman Problem: An

37

Efficient Simulation Algorithm. Journal of Optimization Theory and Applications 45(1)

41–51.

Crawford, Gregory S., Matthew Shum. 2005. Uncertainty and Learning in Pharmaceutical

Demand. Econometrica 73(4) 1137–1174.

Diermeier, Daniel, Michael P. Keane, Antonio M. Merlo. 2005. A Political Economy Model

of Congressional Careers. American Economic Review 95 347–373.

Erdem, Tulin, Susumu Imai, Michael P. Keane. 2003. Brand and Quality Choice Dynamics

under Price Uncertainty. Quantitative Marketing and Economics 1(1) 5–64.

Erdem, Tulin, Michael P. Keane. 1996. Decision Making under Uncertainty: Capturing

Dynamic Brand Choice Processes in Turbulent Consumer Goods Markets. Marketing

Science 15(1) 1–20.

Geweke, John F., Michael P. Keane. 2002. Bayesian Inference for Dynamic Discrete

Choice Models Without the Need for Dynamic Programming. Mariano, Schuermann,

Weeks, eds., Simulation Based Inference and Econometrics: Methods and Applications .

Cambridge University Press, Cambridge, UK.

Gonul, Fusun, Kannan Srinivasan. 1996. Estimating the Impact of Consumer Expec-

tations of Coupons on Purchase Behavior: A Dynamic Structural Model. Marketing

Science 15(3) 262–279.

Hartmann, Wesley R., V. Brian Viard. 2008. Do Frequency Reward Programs Create

Switching Costs? A Dynamic Structural Analysis of Demand in a Reward Program.

Quantitative Marketing and Economics 6(2) 109–137.

38

Hendel, Igal, Aviv Nevo. 2006. Measuring the Implications of Sales and Consumer Inven-

tory Behavior. Econometrica 74(6) 1637–1673.

Hitsch, Gunter. 2006. An Empirical Model of Optimal Dynamic Product Launch and

Exit Under Demand Uncertainty. Marketing Science 25(1) 25–50.

Hotz, Joseph V., Robert Miller. 1993. Conditional Choice Probabilities and the Estimation

of Dynamic Models. Review of Economic Studies 60(3) 497–529.

Imai, Susumu, Neelam Jain, Andrew Ching. 2008. Bayesian Estimation of Dynamic

Discrete Choice Models. Conditionally accepted at Econometrica. Available at SSRN:

http://ssrn.com/abstract=1118130.

Imai, Susumu, Kala Krishna. 2004. Employment, Deterrence and Crime in a Dynamic

Model. International Economic Review 45(3) 845–872.

Keane, Michael P., Kenneth I. Wolpin. 1994. The Solution and Estimation of Discrete

Choice Dynamic Programming Models by Simulation and Interpolation: Monte Carlo

Evidence. Review of Economics and Statistics 74(4) 648–672.

Keane, Michael P., Kenneth I. Wolpin. 1997. The Career Decisions of Young Men. Journal

of Political Economy 105 473–521.

Kirkpatrick, S., C.D. Gelatt, M.P. Vecchi. 1983. Optimization by Simulated Annealing.

Science 220 671–680.

Lancaster, Tony. 1997. Exact Structural Inference in Optimal Job Search Models. Journal

of Business and Economic Statistics 15(2) 165–179.

39

McCulloch, Robert, Peter E. Rossi. 1994. An Exact Likelihood Analysis of the Multino-

mial Probit Model. Journal of Econometrics 64 207–240.

Norets, Andriy. 2008. Inference in Dynamic Discrete Choice Models with Serially Corre-

lated Unobserved State Variables.

Osborne, Matthew. 2007. Consumer Learning, Switching Costs, and Heterogeneity: A

Structural Examination. Working paper, U.S. Department of Justice.

Rossi, Peter E., Greg M. Allenby. 1999. Marketing Models of Consumer Heterogeneity.

Journal of Econometrics 89 57–78.

Rossi, Peter E., Robert McCulloch, Greg M. Allenby. 1996. The Value of Purchase History

Data in Target Marketing. Marketing Science 15 321–340.

Rust, John. 1987. Optimal Replacement of GMC Bus Engines: An Empirical Model of

Harold Zurcher. Econometrica 55(5) 999–1033.

Rust, John. 1997. Using Randomization to Break the Curse of Dimensionality. Econo-

metrica 65(3) 487–516.

Silverman, Bernard W. 1986. Density Estimation for Statistics and Data Analysis . Chap-

man and Hall, London, UK.

Song, Inseong, Pradeep K. Chintagunta. 2003. A Micromodel of New Product Adop-

tion with Heterogeneous and Forward Looking Consumers: Application to the Digital

Camera Category. Quantitative Marketing and Economics 1(4) 371–407.

40

Sun, Baohong. 2005. Promotion Effect on Endogenous Consumption. Marketing Science

24(3) 430–443.

Train, Kenneth E. 2003. Discrete Choice Methods with Simulation. Cambridge University

Press, Cambridge, UK.

Yang, Botao, Andrew Ching. 2008. Dynamics of Consumer Adoption Decisions of Finan-

cial Innovation: The Case of ATM Cards in Italy. Working paper, Rotman School of

Management, University of Toronto.

41

Table 1: Estimation Results: Homogeneous Model

parameter TRUE mean sd mean sd

α1 (intercept for store 1) 0.0 -0.001 0.019 -0.030 0.022


G1 (reward for store 1) 1.0 0.998 0.017 1.052 0.021

G2 (reward for store 2) 5.0 5.032 0.048 5.088 0.085� (price coefficient) -1.0 -0.999 0.016 -0.996 0.019

β (discount factor) 0.6/0.8 0.601 0.008 0.800 0.010

β = 0.6 β = 0.8

NotesSample size: 1,000 consumers for 100 periods.Fixed parameters: S1 = 2, S2 = 4, p = 1.0, σp = 0.3, σGj = 0 for j = 1, 2.Turning parameters: N = 1, 000 (number of past pseudo-value functions used for emax approx-imations), h = 0.01 (bandwidth).

Table 2: Estimation Results: Heterogeneous Model



α2 (intercept for store 2) 0.0 0.010 0.021 0.005 0.037



σG2 (sd of G2) 1.0 1.034 0.046 1.029 0.040� (price coefficient) -1.0 -1.004 0.016 -0.985 0.019

β (discount factor) 0.6/0.8 0.595 0.005 0.798 0.006

β = 0.6 β = 0.8

NotesSample size: 1,000 consumers for 100 periods.Fixed parameters: S1 = 2, S2 = 4, p = 1.0, σp = 0.3, σG1 = 0.Turning parameters: N = 1, 000 (number of past pseudo-value functions used for emax approx-imations), h = 0.01 (bandwidth).

42

Table 3: Computation Time Per MCMC Iteration (in seconds)

algorithm β = 0.6 β = 0.8 β = 0.98 β = 0.6 β = 0.8 β = 0.98

Full solution based Bayesian 0.782 0.807 1.410 31.526 65.380 613.26

IJC with N=1000 1.071 1.049 1.006 19.300 19.599 18.387

Homogeneous

Model

Heterogeneous

Model

NotesSample size: 1,000 consumers for 100 periods.Number of state points: 8 (S1 = 2, S2 = 4).Number of parameters: 6 in homogeneous model; 7 in heterogeneous model.

Table 4: The Impact of N



α2 (intercept for store 2) 0.0 0.032 0.019 0.022 0.019


G2 (reward for store 2) 10.0 9.740 0.063 9.751 0.028� (price coefficient) -1.0 -1.000 0.018 -0.991 0.018

N=100 N=1000

NotesSample size: 1,000 consumers for 100 periods.Fixed parameters: S1 = 5, S2 = 10, p = 1.0, σp = 0.3, σGj = 0 for j = 1, 2, β = 0.98.Turning parameters: h = 0.01 (bandwidth).

43

Table 5: Breakdown of Computation Time Per MCMC Iteration (in seconds) for IJCAlgorithm (β = 0.8)

computation N=100 N=500 N=1000

Emax approximation at θ*r (steps 4(b) & 5(b)) 1.1502 5.6808 11.3149

Likelihood value at θ*r (steps 4(b) & 5(b)) 0.5229 0.5174 0.5264

Emax approximation at θr (step 7) 0.7180 3.5097 6.9819

Likelihood value at θr (step 7) 0.3280 0.3223 0.3271

computation time per iteration 2.7724 10.1219 19.2253

Heterogeneous Model

NotesSample size: 1,000 consumers for 100 periods.Number of state points: 8 (S1 = 2, S2 = 4).Number of parameters: 7.Steps indicated in the table are in section 3.2.2.Step 7 was performed every time a candidate parameter value was rejected.

44

Figure 1: Choice probabilities across states for different discount factors

0.4

0.5

0.6

0.7

0.8

0.9

1

Purchase probability

β=0

β=0.5

β=0.75

β=0.9

β=0.999

0

0.1

0.2

0.3

0.4

0 1 2 3 4


No. of stamps collected (s)

Figure 2: Choice probabilities for different discount factors across states

0.4

0.5

0.6

0.7

0.8

0.9

1

Purchase probability s=0

s=1

s=2

s=3

s=4

0

0.1

0.2

0.3

0.4

0 0.2 0.4 0.6 0.8 1


Discount factor (β)

45

Figure 3: MCMC plots: Homogeneous Model with β = 0.8

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0 2000 4000 6000 8000 10000

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 2000 4000 6000 8000 10000

α1 (true value = 0.0) α2 (true value = 0.0)

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0 2000 4000 6000 8000 10000

0.0

1.0

2.0

3.0

4.0

5.0

6.0

0 2000 4000 6000 8000 10000

G1 (true value = 1.0) G2 (true value = 5.0)

-1.2

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0 2000 4000 6000 8000 10000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 2000 4000 6000 8000 10000

γ (true value = -1.0) β (true value = 0.8)

46

Figure 4: MCMC plots: Heterogeneous Model with β = 0.8

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0 2000 4000 6000 8000 10000

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 2000 4000 6000 8000 10000

α1 (true value = 0.0) α2 (true value = 0.0)

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0 2000 4000 6000 8000 10000

0.0

1.0

2.0

3.0

4.0

5.0

6.0

0 2000 4000 6000 8000 10000

G1 (true value = 1.0) G2 (true value = 5.0)

-1.2

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0 2000 4000 6000 8000 10000

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2000 4000 6000 8000 10000

β

σG22

γ (true value = -1.0) β (true value = 0.8)

σG22 (true value = 1.0)

47

Figure 5: MCMC plots: Impact of N on α1 and α2 when β = 0.98

N = 100 N = 1000

α1 (true value = 0.0)

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 5000 10000 15000 20000 25000 30000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 5000 10000 15000 20000 25000 30000

α2 (true value = 0.0)

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5000 10000 15000 20000 25000 30000

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5000 10000 15000 20000 25000 30000

48

Figure 6: MCMC plots: Impact of N on G1 and G2 when β = 0.98

N = 100 N = 1000

G1 (true value = 1.0)

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5000 10000 15000 20000 25000 30000

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5000 10000 15000 20000 25000 30000

G2 (true value = 5.0)

0

2

4

6

8

10

12

0 5000 10000 15000 20000 25000 30000

0

2

4

6

8

10

12

0 5000 10000 15000 20000 25000 30000

49

Documents

A Practitioner's Guide to Bayesian Estimation of Discrete ... · gramming (DDP) models using the Bayesian Dynamic Programming algorithm developed in Imai, Jain and Ching (2008) (IJC)