Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
QEDQueen’s Economics Department Working Paper No. 1201
A Practitioner’s Guide to Bayesian Estimation of DiscreteChoice Dynamic Programming Models
Susumu ImaiQueen’s University
Andrew ChingUniversity of Toronto
Masakazu IshiharaUniversity of Toronto
Neelan JainNorthern Illinois University
Department of EconomicsQueen’s University
94 University AvenueKingston, Ontario, Canada
K7L 3N6
4-2009
A Practitioner’s Guide to Bayesian Estimation ofDiscrete Choice Dynamic Programming Models∗
Andrew ChingRotman School of Management
University of Toronto
Susumu ImaiDepartment of Economics
Queen’s University
Masakazu IshiharaRotman School of Management
University of Toronto
Neelam JainDepartment of Economics
Northern Illinois University
This draft: April 5, 2009
∗This is work-in-progress. Comments are welcome.
Abstract
This paper provides a step-by-step guide to estimating discrete choice dynamic pro-gramming (DDP) models using the Bayesian Dynamic Programming algorithm developedin Imai, Jain and Ching (2008) (IJC). The IJC method combines the DDP solution al-gorithm with the Bayesian Markov Chain Monte Carlo algorithm into a single algorithm,which solves the DDP model and estimates its structural parameters simultaneously. Themain computational advantage of this estimation algorithm is the efficient use of informa-tion obtained from the past iterations. In the conventional Nested Fixed Point algorithm,most of the information obtained in the past iterations remains unused in the current it-eration. In contrast, the Bayesian Dynamic Programming algorithm extensively uses thecomputational results obtained from the past iterations to help solving the DDP model atthe current iterated parameter values. Consequently, it significantly alleviates the com-putational burden of estimating a DDP model. We carefully discuss how to implementthe algorithm in practice, and use a simple dynamic store choice model to illustrate howto apply this algorithm to obtain parameter estimates.
1 Introduction
In economics and marketing, there is a growing empirical literature which studies choice
of agents in both the demand and supply side, taking into account their forward-looking
behavior. A common framework to capture consumers or firms forward-looking behavior
is discrete choice dynamic programming (DDP) framework. This framework has been
applied to study manager’s decisions to replace old equipments (e.g., Rust 1987), ca-
reer decision choice (e.g., Keane and Wolpin 1997; Diermier, Merlo and Keane 2005),
choice to commit crimes (Imai and Krishna 2004), dynamic brand choice (e.g., Erdem
and Keane 1996; Gonul and Srinivasan 1996), dynamic quantity choice with stockpiling
behavior (e.g., Erdem, Imai and Keane 2003; Sun 2005; Hendel and Nevo 2006), new
product/technology adoption decisions (Ackerberg 2003; Song and Chintagunta 2003;
Crawford and Shum 2005; Yang and Ching 2008), new product introduction decisions
(Hitsch 2006), etc. Although the framework provides a theoretically tractable way to
model forward-looking incentives, and this literature has been growing, it remains small
relative to the literature that models choice using a static reduced form framework. This
is mainly due to two obstacles of estimating this class of models: (i) the curse of dimen-
sionality problem in the state space, putting a constraint on developing models that match
the real world applications; (ii) the complexity of the likelihood/GMM objective function,
making it difficult to search for the global maximum/minimum when using classical ap-
proach to estimate them. Several studies have proposed different ways to approximate the
dynamic programming solutions, and reduce the hurdle due to the curse of dimensionality
problem (e.g., Keane and Wolpin 1994; Rust 1997; Hotz and Miller 1993; Aguirreagabiria
3
and Mira 2002; Ackerberg 2001).1 Nevertheless, little progress has been made in handling
the complexity of the likelihood function from the DDP models. A typical approach is
to use different initial values to re-estimate the model, and check which set of parameter
estimates gives the highest likelihood value. However, without knowing the exact shape
of the likelihood function, it is often difficult to confirm whether the estimated parameter
vector indeed gives us the global maximum.
In the past two decades, Bayesian Markov Chain Monte Carlo (MCMC) approach
has provided a tractable way to simulate the posterior distribution of parameter vectors
for complicated static discrete choice models, making the posterior mean an attractive
estimator compared with classical point estimates in that setting (Albert and Chib 1993;
McCulloch and Rossi 1994; Allenby and Lenk 1994; Allenby 1994; Rossi et al. 1996;
Allenby and Rossi 1999). However, researchers seldom use the Bayesian approach to
estimate DDP models. The main problem is that the Bayesian MCMC approach typically
requires many more iterations than classical approach to achieve convergence. In each
simulated draw of a parameter vector, the DDP model needs to be solved to calculate
the likelihood function. As a result, the computational burden of solving a DDP model
has essentially ruled out the Bayesian approach except for very simple models, where the
solution of the model can be solved very quickly or there exists a close form solution (e.g.,
Lancaster 1997).
Recently, Imai, Jain and Ching (2008) (IJC) propose a new modified MCMC algorithm1Geweke and Keane (2002) proposed to use a flexible polynomial to approximate the future component
of the Bellman equation. Their approach allowed them to conduct Bayesian inference on the structuralparameters of the current payoff functions and the reduced form parameters of the polynomial approx-imations. However, since it completely avoids solving for the DDP model and fully specifying it, theirestimation results are not efficient and policy experiments cannot be conducted under their approach.
4
to reduce the computational burden of estimating infinite horizon DDP models using
the Bayesian approach. This method combines the DDP solution algorithm with the
Bayesian MCMC algorithm into a single algorithm, which solves the DDP model and
estimates its structural parameters simultaneously. In the conventional Nested Fixed
Point algorithm, most of the information obtained in the past iterations remains unused
in the current iteration. In contrast, the IJC algorithm extensively uses the computational
results obtained from the past iterations to help solving the DDP model at the current
iterated parameter values. This new method is potentially superior to prior methods
because (1) it significantly reduces the computational burden of solving for the DDP model
in each iteration, and (2) it produces the posterior distribution of parameter vectors, and
the corresponding solutions for the DDP model – this avoids the need to search for the
global maximum of a complicated likelihood function.
This paper provides a step-by-step guide to estimating discrete choice dynamic pro-
gramming (DDP) models using the IJC method. We carefully discuss how to implement
the algorithm in practice, and use a simple dynamic store choice model to illustrate how
to apply this algorithm to obtain parameter estimates. Our goal is to reduce the costs of
adopting this new method and expand the toolbox for researchers who are interested in
estimating DDP models. The rest of the paper is organized as follows. In section 2, we
present a dynamic store choice model, where each store offers its own reward programs.
In section 3, we present the IJC method and explain how to implement it to obtain pa-
rameter estimates of this model. We also discuss the practical aspects of using the IJC
method. In section 4 we use the estimation results of the dynamic store choice model to
demonstrate certain properties of the IJC estimation method. Section 5 discusses how
5
to extend IJC algorithm to (i) conduct policy experiments, and (ii) allow for continuous
state variables. We also comment on the choice of kernels. Section 6 is the conclusion.
2 The Model2.1 The Basic Framework
Suppose that there are two supermarkets in a city (j = 1, 2). Each store offers a stamp
card, which can be exchanged for a gift upon completion. Consumers get one stamp for
each visit with a purchase.
Reward programs at the two supermarkets differ in terms of (i) the number of stamps
required for a gift (Sj), and (ii) the mean value of the gift (Gj). Consumers get a gift in
the same period (t) that they complete the stamp card. Once consumers receive a gift,
they will start with a blank stamp card again in the next period.
Let pijt be the price that consumer i pays in supermarket j at time t. We assume
that prices of store j in each period are drawn from an iid normal distribution, N(p, σ2p),
which is common across stores. Also, we assume that this price distribution is known to
consumers. Let sijt ∈ {0, 1, . . . , Sj − 1} denote the number of stamps collected for store
j in period t before consumers make a decision. Note that sijt does not take the value
Sj because of our assumption that consumers get a gift in the same period that they
complete the stamp card.
Consumer i’s single period utility of visiting supermarket j in period t at sit =
(si1t, si2t) is given by
Uijt(sit) ={
αj + γpijt + Gij + εijt if sijt = Sj − 1αj + γpijt + εijt otherwise,
where αj is the consumer loyalty (or brand equity) for store j, γ is the price sensitivity,
6
Gij is consumer i’s valuation of gift for store j, and εijt is the idiosyncratic error term.
We assume εijt is extreme-value distributed. Gij is assumed to be normally distributed
around Gj with the standard deviation σGj. In each period, consumers may choose not
to go shopping. The single period mean utility of no shopping is normalized to zero, i.e.,
Ui0t = εi0t.
Consumer i’s objective is to choose a sequence of store choices to maximize the sum
of the present discounted future utility:
max{dijt}∞t=1
E
[ ∞∑t=1
βt−1dijtUijt(sit)]
where dijt = 1 if consumer i chooses store j in period t and dijt = 0 otherwise. β is the
discount factor. The evolution of state, sit, is deterministic and depends on consumers’
choice. Given the state sijt, the next period state, sijt+1, is determined as follows:
sijt+1 =
sijt + 1 if sijt < Sj − 1 and purchase at store j in period t
0 if sijt = Sj − 1 and purchase at store j in period t
sijt if purchase at store −j or no shopping in period t
(1)
Let θ be the vector of parameters. Also, define si = (si1, si2), pi = (pi1, pi2), and
Gi = (Gi1, Gi2). In state si, the Bellman’s equation for consumer i is given by
V (si, pi; Gi, θ) ≡ Eε max{V0(si; θ) + εi0, V1(si, pi1; Gi, θ) + εi1, V2(si, pi2; Gi, θ) + εi2}
= log(exp(V0(si; θ)) + exp(V1(si, pi1; Gi, θ)) + exp(V2(si, pi2; Gi, θ))),
(2)
where the second equality follows from the extreme value assumption on ε. The alternative-
specific value functions are written as: For j = 1, 2,
Vj(si, pij; Gi, θ) ={
αj + γpij + Gij + βEp′ [V (s′, p′; Gi, θ)] if sij = Sj − 1,αj + γpij + βEp′ [V (s′, p′; Gi, θ)] otherwise.
(3)
V0(si; θ) = βEp′ [V (s′, p′; Gi, θ)]
7
where the state transition from si to s′ follows (1), and the expectation with respect to
p′ is defined as
Ep′ [V (s′, p′; Gi, θ] =∫
V (s′, p′; Gi, θ)dF (p′).
The parameters of the model are αj (store loyalty), Gj (consumers’ mean value of the
gift offered by store j), σGj(standard deviation of Gij), γ (price sensitivity), β (discount
factor), p (mean price common across stores), σp (standard deviation of prices).
Hartmann and Viard (2008) estimated a dynamic model with reward programs that is
similar to the one here. The main differences are (1) we allow for two stores with different
reward programs in terms of (Gj, Sj) while they considered one store (golf club); (2) we
estimate the discount factor (i.e., β) while they fixed it according to the interest rate.
The general dynamics of this model is also more complicated than the one used in IJC for
Monte Carlo exercises. The model here has two endogenous state variables (s1, s2), while
the dynamic firm entry-exit decision model used in IJC has one exogenous state variable
(capital stock). However, IJC consider a normal error term, which is more general than
the extreme value error term we assume here. We consider the extreme value error term
because (1) it is quite common that researchers adopt this distributional assumption when
estimating a DDP model, (2) our analysis here would complement that of IJC.
2.2 Identification
The main dynamics of the model is the intertemporal trade-off created by the reward
program. Suppose that a consumer is closer to the completion of the stamp card for
store 1, but the price is lower in store 2 today. If the consumer chooses store 2 based on
the lower price, he or she will delay the completion of the stamp card for store 1. If the
8
consumer take the future into account, the delay will lower the present discounted value of
the reward. Thus he/she will have an incentive to keep shopping at store 1 even though
prices at store 2 is lower. Moreover, such an incentive should depend on the discount
factor.
This dynamic trade-off suggests that the empirical choice frequency of visiting the
stores across states should allow us to pin down the discount factor. To illustrate this
point, we consider a model of homogeneous consumers with only one store and an outside
option and simulate choice probabilities for different discount factors. In this exercise,
we set α1 = −2, γ = 0, G1 = 3, and S1 = 5. Figure 1 shows how the choice probability
of visiting the store changes across states (no. of stamps collected) for different discount
factors (β = 0, 0.5, 0.75, 0.9, 0.999). When β = 0, the choice probability is purely de-
termined by the single period utility. Thus, the choice probability is flat from s = 0 to
s = 3. At s = 4, consumers receive the gift thus the choice probability jumps up. Another
extreme case is when β is close to one (β = 0.999). Interestingly, the choice probabilities
are essentially constant across states, and higher than those of β = 0 for s = 0, ..., 3. But
at s = 4, the choice probability for β = 0.999 is smaller than that for β = 0.
Figure 2 shows how the choice probabilities at each state change with β. In general, as
we increase β, we observe three predictions: (i) when s = 4, the store choice probability
monotonically decreases; (ii) when s < 3, the store choice probability monotonically
increase; (iii) when s = 3, the store choice probability first increases and then decreases.
Why do we have prediction (i)? First, we note that EpV (s) always increases with β for
all s, but the extent of the increase would differ across s. In particular, EV (s = 4) is
much more responsive than EV (s = 0) to an increase in β. This is because a consumer
9
is much closer to getting a reward at s = 4. This shows that when β increases, the value
of choosing the outside option (depending on EV (s = 4)) will increase faster than that
of choosing to visit the store (depending on EV (s = 0)). As a result, the store choice
probability decreases with β. Prediction (ii) is mainly due to the fact that the incentive to
earn a reward increases with β. To understand prediction (iii), we note that in general an
increase in β has two effects on store choice. On the one hand, it increases the incentive
to earn a reward as the consumer puts more weight in the future. On the other hand, a
larger β also implies that the consumer is more patient, and consequently, becomes less
concern about waiting (i.e., the costs of waiting become smaller). The interaction between
these two forces explains prediction (iii). When β first increases from zero, the former
force (i.e., the incentive to earn a reward earlier) dominates. But as β becomes closer to
1, the later force (i.e., the patience factor) catches up. Lastly, it should be noted that the
choice probabilities become flat when β is very close to one. This is because the difference
between these two forces essentially becomes constant across s, when β approaches one.
The above discussion also suggests that the identification of αj and Gj crucially de-
pends on the value of β. When β is close to 1, it is very hard to identify them separately
because they play very similar role in determining the choice probabilities. But when β
moves away from 1, it becomes much easier to identify them.2
2Hartmann and Viard (2008) also discussed how the discount factor would affect the pattern of choiceprobabilities. However, because they take the intrinsic discount factor as exogenously given (determinedby the interest rate), they argue that such an effect would happen through the “artificial” discount factor,which depends on how frequent a customer visits a store (determined by αj here).
10
2.3 Numerical Solution of the Model
We will consider how to solve the model without consumer heterogeneity in Gij first. Solv-
ing the model with consumer heterogeneity is a straightforward extension. To solve the
model without consumer heterogeneity (i.e., Gij = Gj for all i and j = 1, 2) numerically,
we will
1. Make M draws of {pmj }M
m=1 from the price distribution function, N(p, σ2p), for j =
1, 2. We fix these draws below.
2. Start with an initial guess of the Emax functions, e.g., setting V 0(s, p; θ) = 0,∀s, p.
Suppose that we know V l, where l is the number of past iterations. We will discuss
how to obtain V l+1.
3. Substitute {pm}Mm=1 into V l(s, p; θ), and then take the average across pm’s to obtain
a Monte Carlo approximation of Ep′Vl(s′, p′; θ),∀s′.
4. Substitute these approximated expected future value functions into the Bellman
operator and obtain V l+1(s, p; θ),∀s, p.
5. Repeat step 3-4 until EpVl+1(., p; θ) converges.
For the model with consumer heterogeneity in Gij, we will need to compute V l+1(s, p; Gi, θ)
for each i.
11
3 Estimation Method3.1 IJC algorithm
It is well-known that when using maximum likelihood or Bayesian MCMC to estimate
discrete choice dynamic programming models, the main computational burden is that
the value functions need to be solved in each set of trial parameter vector (for maximum
likelihood), or each set of random draw of parameter vector (for Bayesian MCMC). Since
both procedures, in particular Bayesian MCMC, require many iterations to achieve conver-
gence, a typical nested fixed point algorithm will need to repeatedly apply the re-iteration
procedure outlined above to solve for the value functions.3 As a result, the computational
burden is so large that even a very simple discrete choice dynamic programming model
cannot be estimated using standard Bayesian MCMC methods.
The IJC algorithm relies on two insights to reduce the computational burden of each
MCMC iteration: (1) It could be quite wasteful to compute the value function exactly
before the markov chain converges to the true posterior distribution. Therefore, the IJC
algorithm proposes to “partially” solve for the Bellman equation for each parameter draw
(at the minimum, only apply the Bellman operator once in each iteration). (2) The value
functions evaluated at the past MCMC draws of parameters contain useful information
about the value functions at the current draw of parameters, in particular, for those
evaluated within a neighborhood of the current parameter values. However, the traditional
nested fixed point algorithm hardly makes use of them. Therefore, IJC propose to replace
the contraction mapping procedure of solving the value functions with a weighted average3It should be noted that Bayesian MCMC algorithm generally needs to be run 10,000 to 30,000
iterations to obtain enough draws for the posterior distributions.
12
of the pseudo-value functions obtained as past outcomes of the estimation algorithm. In
IJC, the weight depends on the distance between the past parameter vector draw and
the current one – the shorter the distance, the higher the weight. The basic intuition is
that the value function is continuous in the parameter space. Therefore, it is possible
to use the past value functions to form a non-parametric estimate of the value function
evaluated at the current draw of parameter values. Such a non-parametric estimate is
usually computationally much cheaper than the standard contraction mapping procedure,
in particular for β close to 1. Combining these two insights, IJC dramatically reduce the
computational burden of each iteration in the Bayesian MCMC algorithm. This modified
procedure differs from the standard nested fixed point algorithm in an important aspect:
instead of solving the model and search for parameters alternately, it solves and estimates
the model simultaneously.
In the context of the reward program example without consumer heterogeneity, the
outputs of the algorithm in each iteration r include {θr, EpVr(s, p; θr)}, where V r is the
pseudo-value function. To obtain these outputs, IJC make use of the past outcomes of
the estimation algorithm, Hr = {θl, EpVl(s, p; θl)}r−1
l=r−N . We will now extend Bellman
equations (2)-(3) to illustrate the pseudo-value functions defined in IJC.
The pseudo-value functions are defined as follows. To simplify notations, we drop the
i subscript for s and p. For each s,
V r(s, p; θr) = log(exp(V r0 (s; θr)) + exp(V r
1 (s, p1; θr)) + exp(V r2 (s, p2; θr))), (4)
13
where for j = 1, 2,
V rj (s, pj; θr) =
{αj − γpj + Gj + βEr
p′V (s′, p′; θr) if sj = Nj − 1,αj − γpj + βEr
p′V (s′, p′; θr) otherwise,(5)
V r0 (s; θr) = βEr
p′V (s′, p′; θr).
The approximated Emax functions are defined as the weighted average of the past
pseudo-value functions obtained from the estimation algorithm. For instance, ErpV (s, p; θr)
can be constructed as follows:
ErpV (s, p; θr) =
r−1∑l=r−N
EpVl(s, p; θl) Kh(θr − θl)∑r−1
k=r−N Kh(θr − θk), (6)
where Kh(.) is a Gaussian kernel with bandwidth h. To obtain EpVr(s, p; θr), we simulate
a set of prices, {pm}Mm=1, from the price distribution, N(p, σ2
p), and evaluate V r(s, pm; θr)
using (4) and (5) and,
EpVr(s, p; θr) =
M∑m=1
V r(s, pm; θr), (7)
IJC show that by treating the pseudo-value function as the true value function in a
MCMC algorithm, and applying it to estimate a dynamic discrete choice problem with
discrete state space, the modified MCMC draws of θr converge to the true posterior
distribution uniformly.
It should be highlighted that the approximated Emax function defined in (6) is the key
innovation of IJC. In principles, this step is also applicable in classical estimation methods
such as GMM and Maximum Likelihood.4 However, there are at least several advantages
of implementing IJC’s pseudo-value function approach in Bayesian estimation. First, the
non-parametric approximation in (6) would be more efficient if the past pseudo-value4Brown and Flinn (2006) extend the implementation of this key step in estimating a dynamic model
of marital status choice and investment in children using the method of simulated moments.
14
functions are evaluated at θl’s that are randomly distributed around θr. This can be nat-
urally achieved by the Bayesian MCMC algorithm. On the contrary, classical estimation
methods typically require minimizing/maximizing an objective function. Commonly used
minimization/maximization routines (e.g., BHHH, quasi-Newton methods, etc.) tend to
search over parameter spaces along a particular path. Consequently, we believe that the
approximation step proposed by IJC should perform better under the Bayesian MCMC
approach.5 Second, in the presence of unobserved consumer heterogeneity, it is common
that the likelihood function is multi-modal even for static choice problems. In this sit-
uation, Bayesian posterior means often seem to be better estimators of true parameter
values than are classical point estimates. This is because in practice, accurately simulat-
ing a posterior is in many cases much easier than finding the global maximum/minimum
of a complex likelihood/GMM objective function.
Now we turn to discuss how to implement the IJC method to estimate the dynamic
store choice model present here. We consider two versions of the model: (i) without
unobserved consumer heterogeneity, (ii) with unobserved consumer heterogeneity.
3.2 Implementation of the IJC algorithm
In this subsection, we discuss how to estimate the dynamic store choice model using the
IJC algorithm. The steps are similar to the standard Metropolis-Hastings algorithm,
except that we use (6) to calculate the Emax. Let Ibuy,ijt be an indicator function for5A stochastic optimization algorithm, simulated annealing, has recently gained some attention to han-
dle complicated objective function. This algorithm is an adaptation of the Metropolis-Hastings algorithm(Kirkpatrick et al. 1983; Cerny 1985). The IJC method should also be well-suited when incorporatingwith simulated annealing for the classical estimation approach. However, we should note that beforea researcher starts the estimation, this method requires him/her to choose a “cooling” rate. The idealcooling rate cannot be determined a priori. In the MCMC-based Bayesian algorithm, one does not needto deal with this nuisance parameter.
15
purchasing at store j by consumer i at time t, and pit = (pi1t, pi2t) be the price vector for
consumer i at store j at time t. We use (Idbuy,ijt, p
dijt) to denote the observed data.
Our focus is to provide a step-by-step implementation guide and discuss the practi-
cal aspects of the IJC algorithm. Once the readers understand implementation of the
IJC algorithm in this simple example, they should be able to extend it to other more
complicated settings.
3.2.1 Homogeneous Consumers
We first present the implementation of the IJC algorithm when consumers are homoge-
neous in their valuations of Gj (i.e., σGj= 0 for j = 1, 2). The vector of parameters to
be estimated is θ = (α1, α2, G1, G2, γ, β).
The IJC algorithm generates a sequence of MCMC draw of θr, r = 1, 2, .... The
algorithm modifies the Metropolis-Hastings, and involves drawing a candidate parameter
θ∗r in each iteration. For each iteration (r),
1. Start with a history of past outcomes from the algorithm,
Hr = {{θ∗l, EpVl(., p; θ∗l)}r−1
l=r−N , ρr(θr−1)}
where N is the number of past iterations used for Emax approximation; EpVl is the
expected pseudo-value functions w.r.t. p; ρr(θr−1) is the pseudo-likelihood of the
data conditional on ErpV (., p; θr−1).
2. Draw θ∗r (candidate parameter vector) from a proposal distribution q(θr−1, θ∗r).
3. Compute the pseudo-likelihood conditional on θ∗r based on the approximated Emax
functions. Then we determine whether or not to accept θ∗r based on the acceptance
16
probability, min( π(θ∗r)·ρr(θ∗r)·q(θ∗r,θr−1)π(θr−1)·ρr(θr−1)·q(θr−1,θ∗r) , 1). Essentially, we apply the Metropolis-
Hastings algorithm here by treating the pseudo-likelihood as the true likelihood.
If accept, set θr = θ∗r; otherwise, set θr = θr−1. To obtain ρr(θ∗r), we need to
calculate ErpV (., p; θ∗r), which is obtained using the weighted average of the past
expected pseudo-value functions: {EpVl(., p; θ∗l)}r−1
l=r−N . The weight of each past
expected pseudo-value function is determined by Gaussian kernels.
ErpV (s, p; θ∗r) =
r−1∑l=r−N
EpVl(s, p; θ∗l) Kh(θ∗r − θ∗l)∑r−1
k=r−N Kh(θ∗r − θ∗k).
4. Computation of pseudo-value function, EpVr(s, p; θ∗r).
(a) Make M draws of prices, {pm}Mm=1, from the price distribution.
(b) Compute V r0 (s; θ∗r), V r
1 (s, pm1 ; θ∗r) and V r
2 (s, pm2 ; θ∗r), using the approximated
Emax functions computed in Step 3.
(c) Given V r0 (s; θ∗r), V r
1 (s, pm1 ; θ∗r) and V r
2 (s, pm2 ; θ∗r), obtain the pseudo-value
function, V r(s, pm; θ∗r) (see (4)). By averaging V r(s, pm; θ∗r) across pm’s, we
integrate out prices and obtain EpVr(s, p; θ∗r).
5. Compute the pseudo-likelihood, ρr+1(θr), based on a updated set of past pseudo-
value functions.6
6. Go to iteration r + 1.
Note that when implementing the IJC algorithm, we propose to store {θ∗l, EpVl(s, p; θ∗l)}r−1
l=r−N
instead of {θl, EpVl(s, p; θl)}r−1
l=r−N . If we store {θl, EpVl(s, p; θl)}r−1
l=r−N , there may be a6Two points should be noted here. First, in each iteration, there is a new pseudo-value function
evaluated at θ∗r. Second, this step needs not be carried out in a full-solution based Bayesian MCMCalgorithm.
17
significant portion of θl’s that are repeated because the acceptance rate of the M-H step
is usually set at around 13 . In order to conduct the non-parametric approximation for the
expected future value, it is useful to have a set of EpVl’s that span the parameter spaces.
Since θ∗l’s are drawn from a candidate generating function, it is much easier for us to
achieve this goal by storing EpVl’s at θ∗l’s. Moreover, for each candidate parameter draw,
θ∗r, we need to evaluate the expected future payoffs, ErpV (., p; θ∗r), to form the likelihood.
As we have shown above, it will only take an extra step to obtain EpVr(., p; θ∗r). So
storing {EpVr(., p; θ∗l)}r−1
l=r−N will impose little extra costs.
The above description of the implementation of the algorithm is slightly different from
IJC, who propose to draw only one unobserved error term in each iteration. In the current
application, that would be equivalent to drawing one price shock in each iteration, pl, and
store Hr = {{θ∗l, V l(., pl; θ∗l)}r−1l=r−N , ρr(θr−1)}. The main difference between these two
approaches is that the original IJC proposes to store V l instead of EpVl, and the Emax
approximation is,
ErpV (s, p; θ∗r) =
r−1∑l=r−N
V l(s, pl; θ∗l) Kh(θ∗r − θ∗l)∑r−1k=r−N Kh(θ∗r − θ∗k)
.
The advantage of the original approach is that it saves some time in computing EpVr.
However, we believe that the approach proposed here would allow us to to achieve the
same level of precision in terms of integrating out prices by using a smaller N . As a result,
it would also save us some time in computing the weighted average.
We should also note that in the present example where we assume prices are observed,
one can use the observed prices as random realizations in computing EpVr(s, p; θr), pro-
vided that there are sufficient number of observations for each s. The advantage of using
18
the observed prices is that the pseudo-value functions are by-products of the likelihood
function computation. However, this is only possible when we have a reasonably large
number of observations for each state.
We also note that in step 5, it may not be worthwhile to compute the pseudo-likelihood,
ρr+1(θr), in each iteration. If we accept θ∗r, ρr(θ∗r), which is calculated in the M-H step
(i.e., step 3), can be used as a proxy for ρr+1(θr), which is computed in step 5. Note that
their calculations only differ in terms of one past pseudo-value function. If we reject θ∗r,
we set θr = θr−1. So we could use ρr(θr−1) as a proxy for ρr+1(θr), and only compute
ρr+1(θr) once every several successive rejections. Our experiences suggest that we can
obtain a fairly decent gain in computational time if we use this approach.
3.2.2 Heterogeneous Consumers
We now present the implementation of the IJC algorithm when consumers have hetero-
geneous valuations for the reward (σGj> 0). The vector of parameters to be estimated
is θ = (θ1, θ2), where θ1 = (α1, α2, γ, β) and θ2 = (G1, G2, σG1 , σG2). We incorporate the
Bayesian approach for random-coefficient models into the estimation steps for homoge-
neous case. We use a normal prior on Gj and an inverted gamma prior on σGj.
We use the Metropolis-Hastings algorithm to draw Glij, which is consumer i’s valuation
of reward at store j from the population distribution of Gij. This draw is regarded as an
individual specific parameter. As a result, conditional on Gi, the value functions do not
depend on θ2.
Each MCMC iteration mainly consists of three blocks.
(i) Draw Grj ∼ fG(Gj|σr−1
Gj, {Gr−1
ij }Ii=1) and σr
Gj∼ fσG
(σrGj|Gr
j , {Gr−1ij }I
i=1) for j = 1, 2
19
(the parameters that capture the distribution of Gij for the population) where fG
and fσGare the posterior distributions.
(ii) Draw individual parameters Grij by the independent Metropolis-Hastings algorithm
with N(Grj , (σr
Gj)2) as a proposal distribution for all i and j = 1, 2.
(iii) Draw θr1 ∼ fθ1(.|Gr
i ) using the random-walk Metropolis-Hastings algorithm.
The estimation steps are as follows. Step 2-3 belong to block (i), step 4 belongs to
block (ii) and step 5 belongs to block (iii). For each MCMC iteration (r),
1. Start with
Hr = {{θ∗l, Gli∗ , EpV
l(., p; Gli∗ , θ
∗l1 )}r−1
l=r−N , {ρri (Gr−1
i , θr−11 )}I
i=1}
where I is the number of consumers; N is the number of past iterations used for
Emax approximation; i∗ = r − I ∗ int( r−1I
) where int(.) is an integer function that
converts any real number to an integer by discarding its value after the decimal
place.
2. Draw Grj (population mean of Gij) from the posterior density (normal) computed
by σr−1Gj
and {Gr−1ij }I
i=1.
3. Draw σrGj
(population variance of Gij) from the posterior density (inverse gamma)
computed by Grj and {Gr−1
ij }Ii=1.
4. For each i, use the Metropolis-Hastings algorithm to draw Grij.
(a) Use the prior on Gij (i.e., N(Grj , (σr
Gj)2)) as the proposal distribution function
to draw G∗rij .
20
(b) Compute the likelihood for consumer i at G∗rij , i.e., ρr
i (G∗rij , θr−1
1 ) and determine
whether or not to accept G∗ri . The acceptance probability, λ, is given by
λ = min(
π(G∗rij ) · ρr
i (G∗rij , θr−1
1 ) · q(G∗rij , Gr−1
ij )π(Gr−1
ij ) · ρri (Gr−1
ij , θr−11 ) · q(Gr−1
ij , G∗rij )
, 1)
= min(
ρri (G∗r
ij , θr−11 )
ρri (Gr−1
ij , θr−11 )
, 1)
where the second equality follows from the fact that the proposal distribution is
set to the prior on Gij, i.e., q(x, y) = π(y). Let Gri be the one that is accepted.
Note that G∗rij will only affect Er
pV (s, p; G∗ri , θr−1
1 ) for consumer i and not for
other consumers. Thus we only need to approximate the Emax functions for
consumer i, using the weighted average of {EpVl(s, p; Gl
i∗ , θ∗l1 )}r−1
l=r−N , treating
Gi as one of the parameters when computing the weights. In the case of
independent kernels, Emax approximation is,
ErpV (s, p; G∗r
i , θr−11 ) =
r−1∑l=r−N
EpVl(s, p; Gl
i∗ , θr−11 ) Kh(θr−1
1 − θ∗l1 )Kh(G∗ri −Gl
i∗)∑r−1k=r−N Kh(θr−1
1 − θ∗k1 )Kh(G∗ri −Gk
i∗).7
(c) Repeat for all i.
5. Use the Metropolis-Hastings algorithm to draw θr1 = (αr
1, αr2, γ
r, βr) conditional on
Grij.
(a) Draw θ∗r1 = (α∗r1 , α∗r2 , γ∗r, β∗r) (candidate parameter vector).
(b) We then compute the likelihood conditional on (α∗r1 , α∗r2 , γ∗r, β∗r) and {Gri}I
i=1,
based on the approximated Emax functions, and determine whether or not to
accept α∗r1 , α∗r2 , γ∗r, and β∗r. The Emax approximation is described as follows.7Note that {Kh(θr−1
1 −θ∗l1 )}r−1l=r−N is common across consumers. Therefore, one can calculate it outside
the loop when programming this part.
21
For each i and s, ErpV (s, p; Gr
i , θ∗r1 ) is obtained using the weighted average of
the past value functions, {EpVl(s, p; Gl
i∗ , θ∗l1 )}r−1
l=r−N . In computing the weights
for past value functions, we treat Gi as a parameter. Note that in the case of
independent kernels, equation (6) becomes
ErpV (s, p; Gr
i , θ∗r1 ) =
r−1∑l=r−N
EpVl(s, p; Gl
i∗ , θ∗l1 ) Kh(θ∗r1 − θ∗l1 )Kh(Gr
i −Gli∗)∑r−1
k=r−N Kh(θ∗r1 − θ∗k1 )Kh(Gri −Gk
i∗).8
6. Computation of pseudo-value function, EpVr(s, p; Gr
i∗ , θ∗r1 ).
(a) Make M draws of prices, pm, from the price distribution.
(b) Compute V r0 (s; θ∗r1 ), V r
1 (s, pm1 ; Gr
i∗ , θ∗r1 ) and V r
2 (s, pm2 ; Gr
i∗ , θ∗r1 ), using the ap-
proximated Emax functions computed in Step 5 (c).
(c) Given V r0 (s; θ∗r1 ), V r
1 (s, pm1 ; Gr
i∗ , θ∗r1 ) and V r
2 (s, pm2 ; Gr
i∗ , θ∗r1 ), obtain the pseudo-
value function, V r(s, pm; Gri∗ , θ
∗r1 ). By averaging V r(s, pm; Gr
i∗ , θ∗r1 ) across pm’s,
we integrate out prices and obtain EpVr(s, p; Gr
i∗ , θ∗r1 ).
7. Compute the pseudo-likelihood, ρr+1i (Gr
i , θr1) ∀i, based on a updated set of past
pseudo-value functions.
8. Go to iteration r + 1.
In step 1 of the procedure described above, we pick one consumer in each iteration and
store his/her pseudo-value function. Then, we use this pooled set of past pseudo-value
functions across consumers to approximate the emax functions for all consumers. This
is slightly different from IJC, who originally propose to store an individual-specific set of8Note that {Kh(θ∗r1 − θ∗l1 )}r−1
l=r−N is common across consumers. Therefore, one can compute it outsidethe loop that indexes consumers to save computational time.
22
past pseudo-value functions for each consumer. That is, in each iteration we store Hr =
{θ∗l, {Gli, EpV
l(., p; Gli, θ
∗l)}Ii=1}r−1
l=r−N , and use {EpVl(., p; Gl
i, θ∗l)}r−1
l=r−N to approximate
consumer i’s Emax function. One advantage of this approach is that the past pseudo-value
functions used in the emax function approximation are more relevant to each consumer i,
because they are evaluated at Gli’s, which represent the posterior distribution of consumer
i’s value for the gift, and should be closer to G∗ri . Note that this is not the case when we
pool past pseudo-value functions across consumers because different consumers may have
very different values of Gi. This suggests that if we store past pseudo-value functions
individually, we may be able to reduce N in order to achieve the same level of precision
for the emax approximation. This in turn should reduce the computation time. But
one drawback is that we need much more memory to store past pseudo-value functions
individually, although this may not be an important concern given the price of computer
memory has been decreasing rapidly over time.
We only describe steps 2 and 3 briefly here. For the details of these two steps, we refer
readers to Train (2003). When implementing step 5, it could be more efficient to separate
them by blocks if the acceptance rate is low. The trade-off is that when implementing this
step by blocks, we might also increase the number of expected future value approximation
calculations and likelihood evaluations.
3.3 Choice of kernel’s bandwidth and N
The IJC method relies on classical non-parametric methods to approximate the Emax
functions using the past pseudo-value functions generated by the algorithm. One prac-
tical problem of nonparametric regression analysis is that the data becomes increasingly
23
sparse as the dimensionality of the explanatory variables (x) increases. For instance, ten
points that are uniformly distributed in the unit cube are more scattered than ten points
distributed uniformly in the unit interval. Thus the number of observations available to
provide information about the local behavior of an arbitrary regression function becomes
small with large dimension. The curse of dimensionality of this non-parametric technique
(in terms of number of parameters) could be something that we need to worry about.9
The root of this problem is due to the bias-variance trade-off. In general, when the ker-
nel bandwidth is small, the effective number of sample points available around x that
influence the prediction would be small, making the prediction highly sensitive to that
particular sample, i.e., yielding to high variance. When the kernel bandwidth is large, the
prediction becomes overly smooth, i.e., yielding to high bias.
However, in implementing the IJC algorithm, the nature of this problem is different
from the standard non-parametric estimation. Unlike a standard estimation problem
where an econometrician cannot control the sample size of the data set, we can control
the sample size for our nonparametric regressions by storing/using more past pseudo-
value functions (i.e., increasing N). This is similar to the advantage of using the standard
MCMC method to draw from the posterior distribution – the econometrician can control
the number of iterations that requires to obtain convergence. Thus in practice, we expect
that N may need to increase with the number of parameters in the model. As a result, it
would also take more time to compute one iteration.
The discussion above suggests that the convergence rate is typically inversely related9Note that this curse of dimensionality problem is different from that of solving for a dynamic pro-
gramming model, where it refers to the size of the state space increasing exponentially with the numberof state variables and the number of values for each state variable.
24
to the number of dimensions. But the situation that we face now is more subtle for two
reasons. First, it is likely that the convergence rate is model specific, as the shape of the
likelihood function is also model specific. Second, it should also depend on the data sample
size. In general, when estimating a well-identified model with a data set with sufficient
variation, the posterior variance of the parameters decreases with the sample size. This
suggests that when the MCMC converges, the simulated parameter values would move
within a small neighborhood of the posterior means. This implies that the set of past
pseudo-value functions would be evaluated at parameter vectors that are concentrated in
a small neighborhood. We expect that this should alleviate the curse of dimensionality
problem.
It is worth discussing the impact of N on the estimation results. One implication is
that as we increase N , older past pseudo-value functions will be used in the approximated
Emax functions computation. This may result in slow improvements of the approximated
emax values, and may slow down the speed of the MCMC convergence. As we decrease
N , only more recent and accurate past pseudo-value functions will be used in the emax
approximation. However, since the number of the past pseudo-value functions itself be-
comes smaller, the variance of the approximated emax values will increase. This may
result in a higher standard deviation of the posterior distribution for some parameters.
One way of mitigating this trade-off is to set N to be small at the beginning of the IJC
algorithm and let N increase during the MCMC iterations. In this way, we can achieve a
faster convergence and more stable posterior distributions at the same time. Another way
to address this issue is to weight the past N pseudo-value functions differently so that the
more recent pseudo-value functions receive higher weights (because they should be more
25
accurate approximations). In one Monte Carlo experiment that we conduct in the next
section, we show some evidence about the impact of N on the estimation results.
An obvious question that would likely come to researchers’ mind is: How do we choose
N and the bandwidth (h)? We believe that any suggested guidelines should ensure that
the pseudo-value function gives us a good proxy for the true value function. We suggest
that researchers check the distance between the pseudo-value function and the true value
function during the estimation, and adjust N and h within the iterative process. For
instance, researchers can store a large set of past pseudo-value functions (i.e., large N),
and use the most recent N ′ < N of them to do the approximation. This has the advantage
that researchers can immediately increase N ′ if they discover that the approximation is not
good enough. Researchers can start the algorithm with a small N ′, (say N ′ = 100), and
an arbitrary bandwidth (say 0.01). Every 1000 iterations, they can compute the means of
the MCMC draws, ¯θ, and then compare the distance between the pseudo-value function
and the exact value function at ¯θ. If the distance is larger than what the researcher would
accept, increase N ′. Then use N ′ past pseudo-value functions to compute summary
statistics and use standard optimal bandwidth formula (e.g., Silverman’s rule of thumb;
Silverman 1986, p.48) to set h. Of course, the cost of storing a large number of past
pseudo-value function is that it requires more memory. But again thanks to the advance
of computational power, the cost of memory is decreasing rapidly over time these days.
Hence, we expect that memory would become less of a constraint in the near future. This
suggestion would require us to solve for the DDP model exactly once every 1000 iterations.
For complicated DDP models with random coefficients, this could still be computationally
costly. But even in this case, one could simply compare the pseudo-value function and
26
the exact value function at a small number of simulated heterogeneous parameter vectors,
say 5. This would be equivalent to solving 5 homogeneous DDP models numerically and
should be feasible even for complicated DDP models.
4 Estimation Results
To illustrate how to implement the IJC algorithm and investigate some of its properties,
we conduct three Monte Carlo experiments. For each experiment, the simulated sample
size is 1,000 consumers and 100 periods. We use the Gaussian kernel to weigh the past
pseudo-value functions when approximating the Emax functions. The total number of
MCMC iterations is 10,000, and we report the posterior distributions of parameters based
on the 5,001-10,000 iterations. The sample size is 1,000 consumers for 100 periods. For
all experiments, the following parameters are fixed: S1, S2, p = 1.0, and σp = 0.3.
In the first experiment, we are interested in estimating a version of the model without
unobserved heterogeneity. When simulating the data, we set S1 = 2, S2 = 4, σG1 = σG2 =
α1 = α2 = 0, G1 = 1.0, G2 = 5.0, γ = −1.0, and β = 0.6 or 0.8. Our goal is to estimate
α1, α2, G1, G2, γ, and β, treating other parameters as known. To ensure that β < 1, we
transform it as β = 11+exp(φ) For all parameters, flat prior is used. Moreover, we use a
random-walk proposal function. Table 1 summarizes the estimation results, and Figure 3
plots the simulated draws of parameters for the case of β = 0.8. The posterior means and
standard deviations show that the IJC algorithm is able to recover the true parameter
values well. Moreover, it appears that the MCMC draws converge after 2,000 iterations.
In the second experiment, we estimate a version of the model with unobserved het-
erogeneity. For simplicity, we only allow for consumer heterogeneity in G2 (i.e., σG1 = 0).
27
The data is simulated based on the following parameter values: α1 = α2 = 0.0, G1 = 1.0,
G2 = 5.0, σG1 = 0.0, σG2 = 1.0, γ, and β = 0.6 or 0.8. Again, we transform β by the logit
formula, i.e., β = 11+exp(φ) . Our goal is to estimate α1, α2, G1, G2, σG2 , γ, and β, treating
other parameters as known. For α1, α2, G1, γ, and φ, we use flat prior. For G2, we use
a diffuse normal prior (i.e., setting the standard deviation of the prior to ∞). For σG2 ,
we use a diffuse inverted gamma prior, IG(ν0, s0) (i.e., setting s0 = 1, ν0 → 1). Table 2
shows the estimation results, and Figure 4 plots the simulated draws of parameters for
β = 0.8. The IJC algorithm again is able to recover the true parameter values well. The
MCMC draws appear to converge after 2,000 iterations for most of the parameters except
G1, which takes about 3,000 iterations to achieve convergence.
To learn more about the potential gain of IJC in terms of computational time, we
compute the time per iteration and compare IJC’s Bayesian MCMC algorithm with the
full solution based Bayesian MCMC algorithm for both homogeneous model and heteroge-
neous model. In the full solution based Bayesian algorithm, we use 100 simulated draws of
prices to integrate out the future price. For each model, we study three cases: β = 0.6, 0.8
and 0.98. Table 3 summarizes the results based on the average computation time. The
computation time is based on a C program running in a linux workstation with Intel Core
2 Duo E4400 2GHz processor. Note that in the full solution based Bayesian algorithm,
the computation time will increase as the discount factor becomes larger. This is because
the number of steps required for convergence in a contraction mapping increases with the
discount factor (i.e., the modulus). However, the computation time will not be influenced
by the discount factor in the IJC algorithm. In the homogeneous model, the computation
for the full solution based Bayesian is faster for β = 0.6 and 0.8. This is because (i) when β
28
is small, solving for a contraction mapping to get the exact Emax values is not that costly
compared with computing the weighted emax values based on 1,000 past pseudo-value
functions; (ii) full-solution based Bayesian approach does not need to perform step 5 in
the homogeneous case, and step 7 in the heterogeneous case.10 However, when β = 0.98,
IJC algorithm is 40% faster than the full solution algorithm. In the heterogeneous model,
we can see the advantage of the IJC algorithm much clearer. When β = 0.6, the IJC
algorithm is 50% faster than the full solution based Bayesian algorithm; when β = 0.8,
it is about 200% faster; when β = 0.98, it is about 3000% faster. In particular, it is
clear that average computational time per iteration basically remains unchanged in the
IJC algorithm. For the full solution based method, the computational time per iteration
increases exponentially in β because, roughly speaking, we need to solve for the DDP
model for each individual. If there are 1,000 individuals, the computational time will
then be roughly (time per contraction mapping) X 1,000. For the heterogeneous model,
with β = 0.98, it would take about 70 days to run the full solution based Bayesian MCMC
algorithm for 10,000 iterations.11 Using the IJC’s Bayesian MCMC algorithm, it would
take less than 2.5 days to obtain 10,000 iterations.
As discussed above, one issue in using the IJC is how to choose N , the number of the
past pseudo-value functions. In the third Monte Carlo experiment, using the homogeneous
model with β = 0.98, we investigate how changes in N influence the speed of convergence
and the posterior distributions. Note that we use a high discount factor here. This is
because, as we discussed in section 3.3, when the discount factor is large, the number10In this exercise, we perform step 5 in the homogeneous case and step 7 in the heterogeneous case
every time a candidate parameter vector is rejected.11Depending on the convergence rate, the number of iterations required for Bayesian estimation could
be higher than 10,000.
29
of past pseudo-value functions used for the Emax function approximation becomes more
important. Thus, changes in N will have more impacts on the speed of convergence and
the posterior distributions than when the discount factor is small. We simulate the data
given the following set of parameter values: S1 = 5, S2 = 10, α1 = α2 = 0, G1 = 1,
G2 = 10, γ = −1, p = 1.0, and σp = 0.3. Our goal is to compare the performance of the
IJC algorithm using N = 100 and N = 1000. Table 4 shows the posterior distributions of
the parameters. The results show that the posterior means are very similar for both cases.
But the standard deviations for G1 and G2 are smaller for N = 1000. This is consistent
with our arguments earlier in section 3.3 – when using more pseudo-value functions to
do the approximation, the variance of the approximation should become smaller. To see
how the speed of convergence changes with N , we plot the MCMC samplers for α1 and
α2 in Figure 5, and G1 and G2 in Figure 6. It can be seen that when N = 100, the speed
of convergence is faster, but the paths also fluctuate more. Again, this is consistent with
our discussion in section 3.3.
Note that when β = 0.98, the true parameter values are recovered less precisely, in
particular, αj and Gj. This is due to the identification problem that we discussed before
– when the discount factor is large, it does not matter much when a consumer receives
the gift. As a result, Gj would simply shift the choice probabilities, similar to the way
that αj does.
We now turn to discuss how to extend the IJC algorithm to (i) conduct policy exper-
iments, and (ii) allow for continuous state space. We will also comment on the choice of
kernels.
30
5 Extensions5.1 Conducting Policy Experiments
The output of the IJC algorithm is posterior distribution for the parameters of the model,
along with a set of value function (and emax function) estimates associated with each pa-
rameter vector. What if we are interested in a policy experiment that involves changing
a policy parameter by certain percentage (e.g., increase the cost of entry by 100t percent-
age), such that the new parameter vectors do not belong to the support of the posterior
distribution? It would appear that the IJC algorithm does not provide solutions of the
dynamic programming problem evaluated at those policy parameter vectors. In fact, this
limitation would apply even one uses full-solution based Bayesian MCMC algorithm.
Here we propose a minor modification of the IJC algorithm so that we can obtain
the value functions of the new policy parameters as part of the estimation output as
well. Suppose that the researcher is interested in the effect of changing θi to θi, where
θi = (1+ t)θi and θi is the ith element in the parameter vector θ. The modified procedure
needs to store the following additional information: {θ∗l, EpVl(., p; θ∗l)}r−1
l=r−N , where θ∗li =
(1 + t)θ∗li and θ∗l−i = θ∗l−i.
Once the MCMC samplers converge, we will have {θl}Ll=1 as well as {θ∗l, EpV
l(., p; θ∗l)}Ll=L−N+1
as the outputs, where L is the total number of MCMC iterations. To do the policy exper-
iment, (i) take the last M draws of θr, and set the draw of the policy parameter vector as
follows: θr−i = θr
−i and θri = (1 + t)θr
i ; (ii) use {EpVl(., p; θ∗l)}L
l=L−N+1 to form an Emax
at θr, and then obtain the value function and the choice probabilities at θr for each r.
This procedure will increase the computational burden of each iteration due to the
31
calculation of the approximated Emax at θ∗r. However, it is relatively straightforward
to implement, and requires very little extra programming effort. To obtain some insights
about how much more time it would take to include the results for a policy experiment,
we break down the computation time of each iteration into four components based on
our model with unobserved heterogeneity: (i) Emax approximation at θ∗r, (ii) likelihood
evaluation at θ∗r, (iii) Emax approximation at θr based on an updated set of pseudo-
value functions, (iv) likelihood evaluation at θr based on an updated set of pseudo-value
functions. The results are shown in Table 5. Notice that to conduct the policy experiment,
we only need to compute the Emax approximation at θ∗r, and store {EpVl(., ; p, θ∗l)}r−1
l=r−N .
The steps used to calculate the Emax functions at θ∗r are the same as those calculating
the Emax functions at θ∗r. Thus the additional computational time will be the same
as that for step (iii) above, which constitutes about 40-60% of the computation time per
iteration. This indicates that it would roughly increase the computational time by 40-60%
if we use the IJC algorithm to conduct a policy experiment as well.
Finally, we note that there is a limitation of this procedure: we need to know the
magnitude of the change in the policy parameter before seeing the estimation results.
Sometimes researchers may not be able to determine this until they obtain the parameter
estimates.
5.2 Continuous State Space
The state space of the dynamic store choice model described earlier is the number of
stamps collected, which takes a finite number of values. In many marketing and eco-
nomics applications, however, we have to deal with continuous state variables such as
32
prices, advertising expenditures, capital stocks, etc. IJC also describes how to extend the
algorithm to combine with the random grid approximation proposed by Rust (1997). To
illustrate how it works, we consider the homogeneous model here.
Consider a modified version of the dynamic store choice model without unobserved
consumer heterogeneity. Suppose that prices set by the two supermarkets follow a first-
order Markov process (instead of an iid process across time): f(p′|p; θp), where θp is the
vector of parameters for the price process. In this setting, the expected value functions in
equation (5) are conditional on current prices, Ep′ [V (s′, p′; θ)|p]. In the Rust random grid
approximation, we evaluate this expected value function as follows. We randomly sample
M grid points, pm = (pm1 , pm
2 ) for m = 1, . . . , M . Then we evaluate the value functions
at each pm and compute the weighted average of the value functions, where weights are
given by the conditional price distribution.
For each iteration r, we can make one draw of prices, pr = (pr1, p
r2), from a distribution.
For example, we can define this distribution as uniform on [p, p]2 where p and p are the
lowest and highest observed prices, respectively. Then, we compute the pseudo-value
function at pr, V r(s, pr; θr) for all s. Thus, Hr in step 1 of section 3.2.1 needs to be
changed to
Hr = {{θ∗l, pl, V l(., pl; θ∗l)}r−1l=r−N , ρr−1(θr−1)}.
The expected value function given s′, p, and θr is then approximated as follows.
Erp′ [V (s′, p′; θr)|p] =
r−1∑l=r−N
V l(s′, pl; θ∗l) Kh(θr − θ∗l)f(pl|p; θp)∑r−1k=r−N Kh(θr − θ∗k)f(pk|p; θp)
. (8)
Unlike the Rust random grid approximation which fix the number of grid points through-
out the estimation, the random grid points her change at each MCMC iteration. In addi-
33
tion, the total number of random grid points can be made arbitrarily large by increasing
N .
The procedure for obtaining the pseudo-value function in step 4 of section 3.2.1 needs
to be modified slightly. We store V r(s, pr; θ∗r) instead of EpVr(s, p; θ∗r). Specifically, the
pseudo-value function at pr (and θ∗r) is computed as follows. For each s,
V r(s, pr; θ∗r) = log(exp(V r0 (s; θ∗r)) + exp(V r
1 (s, pr1; θ∗r)) + exp(V r
2 (s, pr2; θ∗r))),
where
V rj (s, pr
j ; θ∗r) ={
αj − γprj + Gj + βEr
p′ [V (s′, p′; θ∗r)|pr] if sj = Sj − 1,αj − γpr
j + βErp′ [V (s′, p′; θ∗r)|pr] otherwise,
V r0 (s; θ∗r) = βEr
p′ [V (s′, p′; θ∗r)|pr].
The approximated Emax functions above are computed by equation (8).
Note that if we simply apply the Rust random grid approximation with M grid points
in the IJC algorithm, we need to compute the pseudo-value functions at M grid points
in each iteration. Also, the integration with respect to prices requires us to first compute
the approximated value function at each grid point and then take the weighted average
of the approximated value functions. The computational advantage of the IJC random
grid algorithm described above comes from the fact that we only need to compute the
pseudo-value function at one grid point, pr, in each iteration, and the integration with
respect to prices can be done without approximating the value functions at a set of grid
points separately.
34
5.3 Choice of Kernels
It should be noted that there are many kernels that one could use in forming a non-
parametric approximation for the Emax function. IJC discuss their method in terms of the
Gaussian kernel. Norets (2008) extends IJC’s method by approximating the emax function
using the past value functions evaluated at the “nearest neighbors,” and allowing the error
terms to be serially correlated. At this point, the relative performances of different kernels
in this setting are still largely unknown. It is possible that for models with certain features,
the Gaussian kernel performs better than other kernels in approximating the pseudo-value
function, while other kernels may outperform the Gaussian kernel for models with other
features. More research is needed to document the pros and cons of different kernels, and
provide guidance in the choice of kernel when implementing the IJC method.
6 Conclusion
In this paper, we discuss how to implement the IJC method using a dynamic store choice
model. For illustration purpose, the specification of the model is relatively simple. We
believe that this new method is quite promising in estimating DDP models. Osborne
(2007) has successfully applied this method to estimate a much more detailed consumer
learning model. The IJC method allows him to incorporate much more general unobserved
consumer heterogeneity than the previous literature, and draw inference on the relative
importance of switching costs, consumer learning and consumer heterogeneity in explain-
ing customers persistent purchase behavior observed in scanner panel data. Ching et al.
(2009) have also successfully estimated a learning and forgetting model where consumers
are forward-looking.
35
Bayesian inference has allowed researchers and practitioners to develop more realistic
static choice models in the last two decades. It is our hope that the new method presented
here and its extensions would allow us to take another step to develop more realistic
dynamic choice models and ease the burden of estimating them in the near future.
36
References
Ackerberg, Daniel A. 2001. A New Use of Importance Sampling to Reduce Computational
Burden in Simulation Estimation. Working paper, Department of Economics, UCLA.
Ackerberg, Daniel A. 2003. Advertising, Learning, and Consumer Choice in Experience
Good Markets: An Empirical Examination. International Economic Review 44(3)
1007–1040.
Aguirreagabiria, Victor, Pedro Mira. 2002. Swapping the Nested Fixed Point Algorithm:
A Class of Estimators for Discrete Markov Decision Models. Econometrica 70(4) 1519–
1543.
Albert, James H., Siddhartha Chib. 1993. Bayesian Analysis of Binary and Polychotomous
Response Data. Journal of the American Statistical Association 88 669–679.
Allenby, Greg M. 1994. An Introduction to Hierarchical Bayesian Modeling. Tutorial
Notes, Advanced Research Techniques Forum, American Marketing Association.
Allenby, Greg M., Peter J. Lenk. 1994. Modeling Household Purchase Behavior with
Logistic Normal Regression. Journal of the American Statistical Association 89 1218–
1231.
Brown, Meta, Christopher J. Flinn. 2006. Investment in Child Quality Over Marital
States. Working paper, Department of Economics, New York University.
Cerny, V. 1985. Thermodynamical Approach to the Travelling Salesman Problem: An
37
Efficient Simulation Algorithm. Journal of Optimization Theory and Applications 45(1)
41–51.
Crawford, Gregory S., Matthew Shum. 2005. Uncertainty and Learning in Pharmaceutical
Demand. Econometrica 73(4) 1137–1174.
Diermeier, Daniel, Michael P. Keane, Antonio M. Merlo. 2005. A Political Economy Model
of Congressional Careers. American Economic Review 95 347–373.
Erdem, Tulin, Susumu Imai, Michael P. Keane. 2003. Brand and Quality Choice Dynamics
under Price Uncertainty. Quantitative Marketing and Economics 1(1) 5–64.
Erdem, Tulin, Michael P. Keane. 1996. Decision Making under Uncertainty: Capturing
Dynamic Brand Choice Processes in Turbulent Consumer Goods Markets. Marketing
Science 15(1) 1–20.
Geweke, John F., Michael P. Keane. 2002. Bayesian Inference for Dynamic Discrete
Choice Models Without the Need for Dynamic Programming. Mariano, Schuermann,
Weeks, eds., Simulation Based Inference and Econometrics: Methods and Applications .
Cambridge University Press, Cambridge, UK.
Gonul, Fusun, Kannan Srinivasan. 1996. Estimating the Impact of Consumer Expec-
tations of Coupons on Purchase Behavior: A Dynamic Structural Model. Marketing
Science 15(3) 262–279.
Hartmann, Wesley R., V. Brian Viard. 2008. Do Frequency Reward Programs Create
Switching Costs? A Dynamic Structural Analysis of Demand in a Reward Program.
Quantitative Marketing and Economics 6(2) 109–137.
38
Hendel, Igal, Aviv Nevo. 2006. Measuring the Implications of Sales and Consumer Inven-
tory Behavior. Econometrica 74(6) 1637–1673.
Hitsch, Gunter. 2006. An Empirical Model of Optimal Dynamic Product Launch and
Exit Under Demand Uncertainty. Marketing Science 25(1) 25–50.
Hotz, Joseph V., Robert Miller. 1993. Conditional Choice Probabilities and the Estimation
of Dynamic Models. Review of Economic Studies 60(3) 497–529.
Imai, Susumu, Neelam Jain, Andrew Ching. 2008. Bayesian Estimation of Dynamic
Discrete Choice Models. Conditionally accepted at Econometrica. Available at SSRN:
http://ssrn.com/abstract=1118130.
Imai, Susumu, Kala Krishna. 2004. Employment, Deterrence and Crime in a Dynamic
Model. International Economic Review 45(3) 845–872.
Keane, Michael P., Kenneth I. Wolpin. 1994. The Solution and Estimation of Discrete
Choice Dynamic Programming Models by Simulation and Interpolation: Monte Carlo
Evidence. Review of Economics and Statistics 74(4) 648–672.
Keane, Michael P., Kenneth I. Wolpin. 1997. The Career Decisions of Young Men. Journal
of Political Economy 105 473–521.
Kirkpatrick, S., C.D. Gelatt, M.P. Vecchi. 1983. Optimization by Simulated Annealing.
Science 220 671–680.
Lancaster, Tony. 1997. Exact Structural Inference in Optimal Job Search Models. Journal
of Business and Economic Statistics 15(2) 165–179.
39
McCulloch, Robert, Peter E. Rossi. 1994. An Exact Likelihood Analysis of the Multino-
mial Probit Model. Journal of Econometrics 64 207–240.
Norets, Andriy. 2008. Inference in Dynamic Discrete Choice Models with Serially Corre-
lated Unobserved State Variables.
Osborne, Matthew. 2007. Consumer Learning, Switching Costs, and Heterogeneity: A
Structural Examination. Working paper, U.S. Department of Justice.
Rossi, Peter E., Greg M. Allenby. 1999. Marketing Models of Consumer Heterogeneity.
Journal of Econometrics 89 57–78.
Rossi, Peter E., Robert McCulloch, Greg M. Allenby. 1996. The Value of Purchase History
Data in Target Marketing. Marketing Science 15 321–340.
Rust, John. 1987. Optimal Replacement of GMC Bus Engines: An Empirical Model of
Harold Zurcher. Econometrica 55(5) 999–1033.
Rust, John. 1997. Using Randomization to Break the Curse of Dimensionality. Econo-
metrica 65(3) 487–516.
Silverman, Bernard W. 1986. Density Estimation for Statistics and Data Analysis . Chap-
man and Hall, London, UK.
Song, Inseong, Pradeep K. Chintagunta. 2003. A Micromodel of New Product Adop-
tion with Heterogeneous and Forward Looking Consumers: Application to the Digital
Camera Category. Quantitative Marketing and Economics 1(4) 371–407.
40
Sun, Baohong. 2005. Promotion Effect on Endogenous Consumption. Marketing Science
24(3) 430–443.
Train, Kenneth E. 2003. Discrete Choice Methods with Simulation. Cambridge University
Press, Cambridge, UK.
Yang, Botao, Andrew Ching. 2008. Dynamics of Consumer Adoption Decisions of Finan-
cial Innovation: The Case of ATM Cards in Italy. Working paper, Rotman School of
Management, University of Toronto.
41
Table 1: Estimation Results: Homogeneous Model
parameter TRUE mean sd mean sd
α1 (intercept for store 1) 0.0 -0.001 0.019 -0.030 0.022
α2 (intercept for store 2) 0.0 -0.002 0.019 -0.018 0.028
G1 (reward for store 1) 1.0 0.998 0.017 1.052 0.021
G2 (reward for store 2) 5.0 5.032 0.048 5.088 0.085� (price coefficient) -1.0 -0.999 0.016 -0.996 0.019
β (discount factor) 0.6/0.8 0.601 0.008 0.800 0.010
β = 0.6 β = 0.8
NotesSample size: 1,000 consumers for 100 periods.Fixed parameters: S1 = 2, S2 = 4, p = 1.0, σp = 0.3, σGj = 0 for j = 1, 2.Turning parameters: N = 1, 000 (number of past pseudo-value functions used for emax approx-imations), h = 0.01 (bandwidth).
Table 2: Estimation Results: Heterogeneous Model
parameter TRUE mean sd mean sd
α1 (intercept for store 1) 0.0 -0.005 0.019 -0.022 0.022
α2 (intercept for store 2) 0.0 0.010 0.021 0.005 0.037
G1 (reward for store 1) 1.0 1.017 0.017 1.010 0.019
G2 (reward for store 2) 5.0 5.066 0.065 4.945 0.130
σG2 (sd of G2) 1.0 1.034 0.046 1.029 0.040� (price coefficient) -1.0 -1.004 0.016 -0.985 0.019
β (discount factor) 0.6/0.8 0.595 0.005 0.798 0.006
β = 0.6 β = 0.8
NotesSample size: 1,000 consumers for 100 periods.Fixed parameters: S1 = 2, S2 = 4, p = 1.0, σp = 0.3, σG1 = 0.Turning parameters: N = 1, 000 (number of past pseudo-value functions used for emax approx-imations), h = 0.01 (bandwidth).
42
Table 3: Computation Time Per MCMC Iteration (in seconds)
algorithm β = 0.6 β = 0.8 β = 0.98 β = 0.6 β = 0.8 β = 0.98
Full solution based Bayesian 0.782 0.807 1.410 31.526 65.380 613.26
IJC with N=1000 1.071 1.049 1.006 19.300 19.599 18.387
Homogeneous
Model
Heterogeneous
Model
NotesSample size: 1,000 consumers for 100 periods.Number of state points: 8 (S1 = 2, S2 = 4).Number of parameters: 6 in homogeneous model; 7 in heterogeneous model.
Table 4: The Impact of N
parameter TRUE mean sd mean sd
α1 (intercept for store 1) 0.0 -0.049 0.020 -0.061 0.020
α2 (intercept for store 2) 0.0 0.032 0.019 0.022 0.019
G1 (reward for store 1) 1.0 1.234 0.034 1.246 0.021
G2 (reward for store 2) 10.0 9.740 0.063 9.751 0.028� (price coefficient) -1.0 -1.000 0.018 -0.991 0.018
N=100 N=1000
NotesSample size: 1,000 consumers for 100 periods.Fixed parameters: S1 = 5, S2 = 10, p = 1.0, σp = 0.3, σGj = 0 for j = 1, 2, β = 0.98.Turning parameters: h = 0.01 (bandwidth).
43
Table 5: Breakdown of Computation Time Per MCMC Iteration (in seconds) for IJCAlgorithm (β = 0.8)
computation N=100 N=500 N=1000
Emax approximation at θ*r (steps 4(b) & 5(b)) 1.1502 5.6808 11.3149
Likelihood value at θ*r (steps 4(b) & 5(b)) 0.5229 0.5174 0.5264
Emax approximation at θr (step 7) 0.7180 3.5097 6.9819
Likelihood value at θr (step 7) 0.3280 0.3223 0.3271
computation time per iteration 2.7724 10.1219 19.2253
Heterogeneous Model
NotesSample size: 1,000 consumers for 100 periods.Number of state points: 8 (S1 = 2, S2 = 4).Number of parameters: 7.Steps indicated in the table are in section 3.2.2.Step 7 was performed every time a candidate parameter value was rejected.
44
Figure 1: Choice probabilities across states for different discount factors
0.4
0.5
0.6
0.7
0.8
0.9
1
Purchase probability
β=0
β=0.5
β=0.75
β=0.9
β=0.999
0
0.1
0.2
0.3
0.4
0 1 2 3 4
Purchase probability
No. of stamps collected (s)
Figure 2: Choice probabilities for different discount factors across states
0.4
0.5
0.6
0.7
0.8
0.9
1
Purchase probability s=0
s=1
s=2
s=3
s=4
0
0.1
0.2
0.3
0.4
0 0.2 0.4 0.6 0.8 1
Purchase probability
Discount factor (β)
45
Figure 3: MCMC plots: Homogeneous Model with β = 0.8
-0.4
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0 2000 4000 6000 8000 10000
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 2000 4000 6000 8000 10000
α1 (true value = 0.0) α2 (true value = 0.0)
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0 2000 4000 6000 8000 10000
0.0
1.0
2.0
3.0
4.0
5.0
6.0
0 2000 4000 6000 8000 10000
G1 (true value = 1.0) G2 (true value = 5.0)
-1.2
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0 2000 4000 6000 8000 10000
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 2000 4000 6000 8000 10000
γ (true value = -1.0) β (true value = 0.8)
46
Figure 4: MCMC plots: Heterogeneous Model with β = 0.8
-0.4
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0 2000 4000 6000 8000 10000
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 2000 4000 6000 8000 10000
α1 (true value = 0.0) α2 (true value = 0.0)
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0 2000 4000 6000 8000 10000
0.0
1.0
2.0
3.0
4.0
5.0
6.0
0 2000 4000 6000 8000 10000
G1 (true value = 1.0) G2 (true value = 5.0)
-1.2
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0 2000 4000 6000 8000 10000
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 2000 4000 6000 8000 10000
β
σG22
γ (true value = -1.0) β (true value = 0.8)
σG22 (true value = 1.0)
47
Figure 5: MCMC plots: Impact of N on α1 and α2 when β = 0.98
N = 100 N = 1000
α1 (true value = 0.0)
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 5000 10000 15000 20000 25000 30000
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 5000 10000 15000 20000 25000 30000
α2 (true value = 0.0)
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5000 10000 15000 20000 25000 30000
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5000 10000 15000 20000 25000 30000
48
Figure 6: MCMC plots: Impact of N on G1 and G2 when β = 0.98
N = 100 N = 1000
G1 (true value = 1.0)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 5000 10000 15000 20000 25000 30000
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 5000 10000 15000 20000 25000 30000
G2 (true value = 5.0)
0
2
4
6
8
10
12
0 5000 10000 15000 20000 25000 30000
0
2
4
6
8
10
12
0 5000 10000 15000 20000 25000 30000
49