Data Driven Decision Models - Vidyadhar Kulkarnivkulkarn.web.unc.edu/files/2018/10/alltextDDDM.pdf · standard deviation increase until the maximum mean 1.53 is reached at x= :69

Data Driven Decision Models

Vidyadhar G. KulkarniDepartment of Statistics and Operations Research

University of North CarolinaChapel Hill, NC 27599-3260

October 25, 2018

2

Contents

1 Data-Driven Inventory Management 9

1.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Unknown Parametric F , Observable Demand . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Maximum Likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.2 Bayesian Updates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.3 Operational Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 Unknown Non-parametric F , Observable Demand . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.1 Maximin Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.2 Minimax Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.3 Empirical Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.3.4 Operational Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4 Unknown Parametric F , Censored Demand . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.4.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4.2 Bayesian Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4.3 Optimal Bayes’ Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.5 Unknown Non-parametric F , Censored Demands . . . . . . . . . . . . . . . . . . . . . . . 23

1.5.1 Kaplan-Meier Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.5.2 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.5.3 CAVE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.5.4 AIM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.6 Data Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3

4 CONTENTS

1.6.1 Data Driven Approach: Linear Programming . . . . . . . . . . . . . . . . . . . . . 26

1.6.2 Demands with Covariates: Linear Regression . . . . . . . . . . . . . . . . . . . . . 27

1.6.3 Demands with Covariates: Linear Programming . . . . . . . . . . . . . . . . . . . . 28

1.6.4 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.6.5 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2 Data-Driven Offer Optimization 35

2.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.1.1 Non-stationary Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2 Unknown Parametric F : Bayesian Approach. . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2.1 Normal Distribution with Unnown Mean . . . . . . . . . . . . . . . . . . . . . . . 40

2.2.2 Uniform Distribution with Unknown Upper Bound . . . . . . . . . . . . . . . . . . 41

2.3 Unknown Parametric F: Maximin Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.3.1 Uniform Offers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.3.2 Normal Offers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.4 Unknown Non-parametric F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4.1 Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.4.2 Scarf’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4.3 Relative Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Data-Driven Staffing of Service Systems 47

3.1 The Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1.1 M/M/1 queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1.2 M/M/s queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.3 M/M/s/s queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.4 M(t)/M/∞ queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.1.5 M/M/s with abandonment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Staffing in single-class many-server service systems . . . . . . . . . . . . . . . . . . . . . . 49

3.2.1 Halfin-Whitt Staffing Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

CONTENTS 5

3.2.2 Garnett-Mandelbaum-Reiman Staffing Rule . . . . . . . . . . . . . . . . . . . . . . 51

3.3 Multi-class Multi-pool Service Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4 Time Varying Staffing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 Bayesian Staffing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 Data Driven Staffing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6.1 Stochastic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6.2 Robust Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6.3 Bassamboo-Zheevi Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.7 Data Driven Staffing: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.7.1 Arrival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.7.2 Decision Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.7.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Data-Driven Revenue Management and Dynamic Pricing 63

4.1 Yield Management: Fixed Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1.2 Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2 Finite Inventory, Dynamic Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.1 A Simple Two-period Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.2 Discrete Time Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2.3 Continuous Time Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3 Dynamic Pricing with Unknown Demand Function . . . . . . . . . . . . . . . . . . . . . . 69

4.3.1 Discrete Time Model: Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.2 An n-period Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.3 Minimax Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.4 Discrete Time Model: Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . 73

4.4 Unlimited Inventory, Dynamic Pricing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.2 Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6 CONTENTS

4.4.3 Minimax Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4.4 Neural Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.5 Robust Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Data Driven Medical Decisions 81

5.1 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.1.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.1.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.1.3 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.1.4 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.1.5 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1.6 Principal Component Analysis and Singular Value Decomposition . . . . . . . . . . 85

5.1.7 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1.8 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.1.9 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2 Breast Cancer Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3 Epidemic Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3.1 SIR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3.2 Control of Epidemics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4 Precision Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4.1 Patient-specific Treatment Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.4.2 Patient-Specific Dosage Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.4.3 Dynamic Control for Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6 Other Topics 103

6.1 Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 Rating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3 Online Advertising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4 Predictive Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

CONTENTS 7

6.5 Sports Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

8 CONTENTS

Chapter 1

Data-Driven Inventory Management

1.1 The Basic Model

We begin with a basic newsvendor model from Hadley and Whitin [48]. A newsvendor buys x news-papers in the morning from the dealer. The demand D is a random variable with a known distributionF (y) = P(D ≤ y), y ≥ 0. The newsvendor has to select x before knowing the actual value of D. Thequestion is: what is the optimal order quantity x∗? Clearly, this depends on the objective function that thenewsvendor wants to optimize. We present several possible objective functions below.

Maximize Expected ProfitThe most common objective is to maximize the expected profit. Suppose the purchase cost is c per paper,

and selling price is s > c per paper during the rest of the day. The salvage value of any leftover papers iszero. There is no cost to the lost sales, other than the lost revenue. (These assumptions can be relaxed.) The(random) profit for the newsvendor is given by

φ(x,D) = smin(x,D)− cx = sD − cx− s(D − x)+, x ≥ 0. (1.1)

The expected profit is given by

φ(x) = E(φ(x,D)) = s

∫ x

0(1− F (u))du− cx. (1.2)

This is convex function of x. To describe where it achieves its maximum, we need the following notation:For a ∈ [0, 1], if there is a y ≥ 0 such that F (y) = a, define

F−1(a) = y ≥ 0 : F (y) = a.

If there is no y ≥ 0 such that F (y) = a, define

F−1(a) = supy ≥ 0 : F (y) ≤ a.

9

10 CHAPTER 1. DATA-DRIVEN INVENTORY MANAGEMENT

Note that if D is a continuous random variable, F−1(a) is a singleton. Otherwise, it is an interval whichmay be a singleton. We can show that φ(x) is maximized at

x∗ ∈ F−1(s− cs

), (1.3)

Thus the optimum order quantity may not be unique. The newsvendor’s profits are maximized if he ordersx∗ newspapers in the morning.

For example, when D is an Exp(µ) random variable, we have

φ(x) = s(1− e−µx)/µ− cx. (1.4)

In this case F−1 is a singleton, and we get

x∗ =1

µln(sc

)= E(D) ln

(sc

). (1.5)

The maximum profit is given by

φ∗ = φ(x∗) = E(D)(s− c(1 + ln(s/c))).

Similarly, when D is uniformly distributed over [0, a], we have

φ(x) = (s− c)x− sx2/(2a). (1.6)

This is maximized atx∗ = a

s− cs

= 2E(D)s− cs

. (1.7)

The maximum profit in this case is given by

φ∗ = E(D)s− cs

.

Minimize Expected CostAn alternative to maximizing profits is to minimize costs. We assume that every unsold paper costs h dollars(holding cost), and every lost sale cost b dollars (back-order cost). We can think of h as the cost of overstocking, and b as the cost of under stocking. The expected cost is then given by

χ(x,D) = h(x−D)+ + b(D − x)+. (1.8)

We can show that

χ(x) = E(χ(x,D)) = bE(D) + hx− (h+ b)

∫ x

0(1− F (u))du.

As seen earlier, this is minimized at

x∗ ∈ F−1(

b

h+ b

).

1.1. THE BASIC MODEL 11

Maximize Tail ProbabilityAnother common objective function is to maximize P(φ(x,D) ≥ z), where z ≥ 0 is a given target

profit level. Let S = min(x,D) be the actual sales. We have P(S ≥ y|x) = 1 − F (y) if y ≤ x andP(S ≥ y|x) = 0 if y > x. Hence

P(φ(x,D) ≥ z) =

1− F

(z+cxs

)z ≤ (s− c)x,

0 z > (s− c)x.

Thus the optimal order quantity isx∗ = z/(s− c).

This is independent of F ! This is a consequence of having no penalty other than lost sales. See Lau [64] forfurther details.

Mean Variance TradeoffLet k ≥ 0 be a given constant and σ(Z) be the standard deviation of a random variable Z. Here we considerthe objective function

E(φ(x,D))− kσ(φ(x,D)). (1.9)

Thus the aim is to achieve a mean-variance tradeoff. Under this criterion, the newsvendor is willing to accepta lower expected profit if it comes with reduced variability. Another way of interpreting the above objectivefunction is to think that the newsvendor is trying to solve the optimization problem:

Maximize E(φ(x,D))

Subj To: σ(φ(x,D)) ≤ β,

where β is a fixed constant. Or, alternatively,

Minimize σ(φ(x,D))

Subj To: E(φ(x,D)) ≥ α,

where α is a given constant. The objective function in Equation 1.9 is the Lagrangian formulation of theabove constrained optimization problems.

We illustrate the concept with two numerical examples below. The first example assumes iid exponentialdemands with µ = 1, s = 10, c = 5. Figure 1.1 shows the graph of the mean versus the standard deviationof the profit. The point (0, 0) corresponds to the order quantity x = 0. As x increases, both the mean andstandard deviation increase until the maximum mean 1.53 is reached at x = .69. As x increases beyond thatthe expected profit decreases, but the standard deviation keeps on increasing. The second example assumesiid Uniform demands with a = 2, s = 10, c = 5. Figure 1.2 shows the graph of the mean versus the standarddeviation of the profit for the uniform case. The point (0, 0) corresponds to the order quantity x = 0. As xincreases, both the mean and standard deviation increase until the maximum mean 2.5 is reached at x = 1.As x increases beyond that the expected profit decreases, but the standard deviation keeps on increasing. Inboth the figures, the lower part of the curve is the efficient frontier. The objective function in Equation 1.9


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

Mean Profit

0

1

2

3

4

5

6

Std De

v of P

rofit

Figure 1.1: Mean-Variance Tradeoff for Exponential Demands.

0 0.5 1 1.5 2 2.5

Mean Profit

0

1

2

3

4

5

6

Std De

v of P

rofit

Figure 1.2: Mean Variance Tradeoff for Uniform Demands.


is maximized at a point on the efficient frontier where the slope of the curve is 1/k.

Maximize Expected UtilityLet U : [0,∞) → R be an increasing concave function representing a utility function of a risk averse

newsvendor. A risk neutral utility function is linear. A general objective function is to choose the orderquantity to maximize the expected utility of the net profits. We first introduce the concept of concaveordering of random variables. Let F and G be the distributions of non-negative random variables X and Yrespectively. We say that X is less than (or equal to) Y in concave ordering if

E(U(X)) ≤ E(U(Y ))

for all increasing concave functions U . That is,∫ ∞0

U(t)dF (t) ≤∫ ∞0

U(t)dG(t)

for all increasing concave functions U . One can show that F is less than G in concave ordering if and onlyif ∫ x

0(1− F (t))dt ≤

∫ x

0(1−G(t))dt

for all x ≥ 0. Now let Fα and Gα be the α-th quantile of X and Y respectively. That is

Fα = inft ≥ 0 : F (t) ≥ α.

Then one can show that another equivalent condition (See Bertsemas and Thiele [15].) for concave orderingis the following:

E(X|X ≤ Fα) ≤ E(Y |Y ≤ Gα),

for all α ∈ [0, 1].

In the context of the newsvendor problem, the net profit is given by φ(x,D), which is a random variablethat depends on x. Thus we want to find an x that maximizes E(U(φ(x,D))). If this is to work for allincreasing concave U ’s, such an x must maximize E(φ(x,D)|φ(x,D) ≤ φα(x)) for all α ∈ [0, 1], whereφα(x) is the α-th quantile of φ(x,D). We can see that

φα(x) = smin(Fα, x)− cx.

Using this we get

gα(x) = E(φ(x,D)|φ(x,D) ≤ φα(x)) = s

∫ min(x,Fα)

0(1− F (y)/α)dy − cx. (1.10)

Let x∗(α) be the value of x that maximizes gα. It is easy to see that

x∗(α) ∈ F−1(αs− cs

). (1.11)

Assignment: Give the details of the derivation of the above three equations.


Note that x∗(α) maximizes the utility of the form U(x) = min(x, Fα). Also, this optimal order size isless than the x∗ given in Equation 1.3. Thus, the risk averse newsvendor orders less than the risk neutralnewsvendor, which makes intuitive sense. Clearly, x∗(α) is a function of α. Hence there won’t be any singleorder quantity that maximizes the expected utility for all increasing concave utility functions. However, thiswill provide us with an important tool to incorporate data when the demand distribution is unknown.

Thus the optimal policy is easy to implement if F is known. However, in practice, F is not known.What do we do in that case?

1.2 Unknown Parametric F , Observable Demand

Let Di be the demand on the ith day, and suppose Di, i ≥ 1 is a sequence of iid random variables withcommon cdf F . We assume that F is unknown, but belongs to a parametric family with a parameter θ, thatis,

F (y) = F (y|θ).

The demands are fully observable. On day (n + 1) the newsvendor has to choose the order quantity xn+1

based upon the observed data so far, namely, D(n) = [D1, D2, · · · , Dn]. In this section we consider severalways of doing this.

1.2.1 Maximum Likelihood.

Suppose T (D(n)) is a sufficient statistics for θ. Let θn be the estimate Maximum Likelihood Estimate of θbased on sufficient statistics T (D(n)), and let

Fn(y) = F (y|θn).

Then one can compute the optimum order quantity xn using Fn in place of F in Equation 1.3. We knowthat Fn is asymptotically consistent, hence, under mild conditions, xn converges to x∗ as n→∞.

For example, when the demands are iid Exp(µ), the sufficient statistics is

T (D(n)) =

n∑i=1

Di,

with the MLE of µ given byµn =

n∑ni=1Di

, (1.12)

and the optimal order quantity is given by

xn+1 =

∑ni=1Di

nln(sc

)(1.13)

When the demands are iid U(0, a), the sufficient statistics is

T (D(n)) = maxD1, D2, · · · , Dn,

1.2. UNKNOWN PARAMETRIC F , OBSERVABLE DEMAND 15

with the MLE of a given byan = maxD1, D2, · · · , Dn.

The optimal order quantity is given by

xn+1 = maxD1, D2, · · · , Dn(s− cs

)(1.14)

There is one problem with this approach: we do not know what to do in the first period, when no data isavailable.

1.2.2 Bayesian Updates.

For Bayesian analysis we assume that π1 is the prior density of θ, the parameter that characterizes the densityf(y|θ). We compute the unconditional density f1(y) of D1 by

f1(y) =

∫θf(y|θ)π1(θ)dθ.

Using the cdf F1 corresponding to f1 in place of F in Equation 1.3 we compute the optimal order quantityx1 to be used in period 1, when we do not have any data at all. This solves one problem with the MLEapproach under which we do not know how to compute x1. After observing D1 = y we update the prior asfollows:

π2(θ|D1 = y) =f1(y|θ)π1(θ)∫

θ f1(y|θ)π1(θ)dθ.

This process can then be repeated every period. See Scarf [96] and Azoury [7] for Bayesian analysis of amore general model.

Note that this is indeed an optimal policy for the sequential newsvendor problem, since the orderingdecision in period n has no effect on the information available in period n + 1. This is the consequence ofthe assumption that the demands are completely observable.

For Bayesian analysis of exp(µ) demands, suppose the prior of µ is Gamma(α, β), with density βαµα−1e−βµ/Γ(α).This is the conjugate prior for exponential distribution. Then the unconditional distribution of D is given by

F (x) = P(D > x) =

∫ ∞0

1

Γ(α)e−µxβαµα−1e−µβdµ =

(β

x+ β

)α.

Hence the optimal order quantity is given by

x = F−1((s− c)/s) = β((s/c)1/α − 1). (1.15)

The posterior of µ after observing D1, D2, · · · , Dn is Gamma(αn, βn) where

αn = α+ n, βn = β +n∑i=1

Di.


We use this to determine the order quantity on day n + 1, using Equation 1.15 using the updated valuesα = αn and β = βn. Note that the posterior mean is (α+ n)/(β +

∑ni=1Di), which approaches the MLE

µn as n→∞.

For Bayesian analysis of U(0, a) demands, suppose the prior of a is Pareto(r, k), with complementarycdf (r/a)k, a ≥ r, with r > 0 and k > 0. This is the conjugate prior for the parameter a of U(0, a)

distribution. The unconditional density of D is given by

fD(y) =

k

(k+1)r if 0 ≤ y < r,kk+1 ·

rk

yk+1 if y ≥ r.

The unconditional expectation is kr2(k−1) . We use this unconditional density to compute the optimal order

quantity using Equation 1.3. Assuming (s− c)/s > 1/(k + 1), this is given by

x∗1 = r

(s

(k + 1)(s− c)

)1/k

.

The posterior of a after observing D1, D2, · · · , Dn is Pareto(rn, kn) where

rn = maxr,D1, D2, · · · , Dn, kn = k + n.

Note that the posterior mean of a approaches the MLE of a as n → ∞ as long as r is sufficiently small.Using this we compute

x∗1 = rn

(s

(kn + 1)(s− c)

)1/kn

.

1.2.3 Operational Statistics.

This material is based on Liyanage and Shanthikumar [73] and Chu et al [31]. The MLE method describedabove is a two-step procedure: We use the data to first estimate the parameters of the demand distribution,and then use this estimated parameter to compute the optimal order quantity. Under the methodology ofOperational Statistics, we use the dataD = [D1, D2, · · · , Dn] to directly estimate the optimal order quantityg(D). We assume that Zi, i ≥ 1 are iid with a known density f and that Di = θZi, where the scaleparameter θ is unknown. Thus the likelihood function of D, for a given θ, is

fD(D|θ) =1

θn

n∏i=1

f(Di).

The goal is to find an order quantity (called the operational statistics) g(D) such that the expected profit

E(φ(g(D)|θ)) =

∫x≥0

E(φ(g(x)|θ)fD(x|θ)dx

is maximized for all θ. There may not exist such a g unless we restrict the set of admissible g′s in some way.Since we have assumed that the scale parameter is unknown, it makes sense to restrict the set of operationalstatistics g to the set H of degree one homogeneous functions, that is we insist that g(αx) = αg(x) hold for

1.2. UNKNOWN PARAMETRIC F , OBSERVABLE DEMAND 17

any scalar α ≥ 0. It is then possible to show that there is an optimal operational statistics within the classH . The main theorem is given below. Let h∗(x) be the function defined by

h∗(D) = argmaxy≥0

∫ ∞θ=0

1

θφ(y|θ)1

θfD(D|θ)dθ

. (1.16)

Theorem 1.1. h∗(D) define by Equation 1.16 belongs toH and it maximizes the expected profit E(φ(g(D)|θ))for all θ ≥ 0, and all g ∈ H .

Note that the integral in Equation 1.16 can be interpreted as the expectation of φ/θ under the Bayesianprior 1/θ for θ. This is an un-normalizable prior. However, it is well known as the non-informative orJeffrey’s prior for the scale parameter. It can be used as long as the integral is well defined. We choose tomaximize the expectation of φ/θ, since this is intuitively normalized for the scale parameter θ.

We illustrate with the example of exponential demands. Here we have f(z) = exp(−z) for z ≥ 0. ThenDi = θZi is an exp(1/θ) random variable, with unknown scale parameter θ. The integral in Equation 1.16is given by ∫ ∞

θ=0sθ(1− e−y/θ − cy)

1

θn+2fD(D|θ)dθ.

This reduces tos

(n− 1)!

(∑n

i=1Di)n− (n− 1)!

(y +∑n

i=1Di)n

− cy n!

(∑n

i=1Di)n+1.

This is maximized aty = h∗(D) = n

((sc

)1/(n+1)− 1

) ∑ni=1Di

n. (1.17)

Compare this to the MLE order size given in Equation 1.13. Both are proportional to the sample average,but the constants are different. Since h∗(D) is optimal within H , and the MLE order quantity belongs to H ,it is clear that the expected profit using the operational statistics beats that using the MLEs.

Assignment: Give the details of the derivation of Equation 1.17.

As another example, suppose Z ∼ U(0, 1). Then Di = θZi are U(0, θ) with unknown scale parameterθ. Let D[n] = maxD1, D2, · · · , Dn. The integral in Equation 1.16 is given by∫ ∞

θ=0(s(y − y2/(2θ))− cy)

1

θn+2dθ.

The integral can be computed as a function of y by considering two cases: y ≤ D[n], or y ≥ D[n]. The ythat maximizes the above integral is by (see Chu et al [31])

y = h∗(D) =

n+2n+1

(1− c

s

)D[n] if n ≥ s/c− 2(

s(n+2)c

)1/(n+1)D[n] if n ≤ s/c− 2.

(1.18)

Assignment: Give the details of the derivation of Equation 1.18.

Assignment: Present a systematic method of deriving operational statistics using Bayesian analysisusing Chu et al [31].


1.3 Unknown Non-parametric F , Observable Demand

In this section we assume that the demand distribution is non-parametric and unknown. The demands areobservable. Here we discuss several approaches explored by the researchers to handle this case.

1.3.1 Maximin Criterion

Scarf et al [97] treat this case in an interesting way. They assume that the mean and the variance of thedemand distribution are known. Then they find the optimal order quantity that maximizes the minimum ex-pected profit among all demand distributions with this given mean and variance. This is called the maximinorder quantity. Gallego and Moon [37] give a simpler proof of their result. We summarize their result below.

Let G be the set of distributions that have mean µ and variance σ2. They first show that

E((D − x)+) ≤ 1

2((σ2 + (x− µ)2)1/2 − (x− µ)), (1.19)

whenever the distribution ofD belongs to G. Then they show that for every x there is aGx ∈ G that achievesthe above upper bound. This Gx is a two point distribution that gives weights αi to points yi (i = 1, 2).They give explicit expressions for αi and yi in terms of x, µ, σ2. Thus, using Equation 1.1, we see that theminimum expected profit for a given order quantity x is given by

sµ− cx− s

2((σ2 + (x− µ)2)1/2 − (x− µ)).

Now we can choose an x that maximizes this minimum expected profit. This maximin x is given by

x∗ = µ+σ

2

(√s− cc−√

c

s− c

).

Note that x∗ ≥ µ if (s − c)/c ≥ 1, that is, if the markup is more than 100%! Note that the maximin orderquantity and the worst case expected profit maybe negative! In this it is clearly better to order zero items.The authors prove that it is optimal to order the maximin quantity if (s−c)/c ≥ (σ/µ)2, and zero otherwise.They consider several extensions, such as multiple items, and random yields.

Assignment: Present the proof of this result from Gallego and Moon [37].

1.3.2 Minimax Criterion

The minimax decision discussed above tends to be very conservative, since it is designed to do well in theworst case. An alternative is to minimize the regret of behaving non-optimally. We define this first. Let Gbe the distribution of the demand, and let φG(x) be the expected net profit if the order quantity is x. (Weadd the G to emphasize the dependence on G. Let x∗G be the optimal order quantity that maximizes φG(·),and let φ∗G = φG(x∗G). Then, the regret of using the order quantity x is define as

rG(x) = φ∗G − φG(x).

1.3. UNKNOWN NON-PARAMETRIC F , OBSERVABLE DEMAND 19

This quantity is positive, and bigger regret implies worse decision x. So we want to minimize the regret.However, since we do not know the distribution G, we aim to find the order quantity x∗ that minimizes themaximum value of rG(x) over all distributionsG belonging to a given class of distributions G. The resultingvalue of x is called the minimax regret order quantity. Perakis and Roel [86] have analyzed this problemfor several possible sets of distributions. We briefly summarize some of their results below in terms of theparameter β = c/s.

Case 1: G is the set of all distributions with support on the interval [A,B]. Then the minimax orderquantity is given by

x∗ = βA+ (1− β)B.

The minimax regret isr∗ = β(1− β)(B −A).

Case 2: G is the set of all distributions with a given mean µ. Then the minimax order quantity is givenby

x∗ =

µ(1− β) if β ≥ 1/2

µ/(4β) if β ≤ 1/2.

The minimax regret is

r∗ =

µβ(1− β) if β ≥ 1/2

µ/4 if β ≤ 1/2.

Case 3: G is the set of all distributions with a given mean µ and a given variance σ2. In this case thereis no explicit expression for the minimax order quantity. It is given by the solution to a nonlinear equationsgiven below. Let

a(x) = maxy

(µ

y− β

)(y − x) : max(µ, x) ≤ y ≤ (σ2 + µ2)/µ

,

b(x) = maxy

(σ2

σ2 + (y − µ)2− β

)(y − x) : max(x, (σ2 + µ2)/µ) ≤ y ≤ x+

√σ2 + (x− µ)2

,

c(x) = maxy

((y − µ)2

σ2 + (y − µ)2− β

)(y − x) : max(0, x−

√σ2 + (x− µ)2) ≤ y ≤ min(x, µ)

.

Then the minimax order quantity is the unique x that satisfies

max(a(x), b(x)) = c(x).

Clearly this has to be done numerically.

Assignment: Present proofs of the above cases.

The maximin and minimax approaches have fallen out of favor and are included here mainly as elegantmathematical models. Typically you would know mean and variance as an estimate from the past data. Thequestion then arises: why not just use the whole data? This leads us to the next subsection.


1.3.3 Empirical Distribution.

The nonparametric estimator of F , based on the observation D(n) = [D1, D2, · · · , Dn], is given by

Fn(y) =1

n

N∑i=1

1Di≤y, y ≥ 0.

It is known that Fn is a consistent estimator of F . One can compute the optimum order quantity xn usingFn in place of F in Equation 1.3. In fact, one need not compute Fn at all, but directly compute the orderquantity. We need the following notation: Let D(1), D(2), · · · , D(n) be the ascending order statistics ofD1, D2, · · · , Dn. We see that, if r = n(s− c)/s is an integer, we can choose any

xn ∈ [D(r), D(r+1)), (1.20)

otherwise we choose r = dn(s− c)/se and choose

xn = D(r).

One can show that xn converges to x∗ in probability as n→∞.

1.3.4 Operational Statistics.

Liyanage and Shanthikumar [73] provide an Operational Statistical analysis in the case of exp(θ) demands.They consider an Operational Statistics of the form

D(r−1) + a(D(r) −D(r−1)), r ≥ 1, a ≥ 0.

They find that the optimal parameters are:

r∗ = argmin

2

√s

c(n−r+2n+1

) +

r−1∑k=1

1

n− r + 1: 1 ≤ r ≤ (n+ 1)(s− c)/s

,

and

a∗ = (n− r∗ + 1)

√ s

c(n−r∗+2n+1

) − 1

.

It is easy to see that this will produce larger expected profit than the one produced by the empirical distribu-tion analysis.

1.4 Unknown Parametric F , Censored Demand

So far we have assumed that the demand data is completely known. However, in many situations we onlyhave sales data, and not the demand data. To make this precise, we introduce the following notation.

1.4. UNKNOWN PARAMETRIC F , CENSORED DEMAND 21

Let Si be the sales on day i. We have Si = min(xi, Di), where xi is the amount ordered at the beginningof the ith day, and Di is demand on day i. Let Yi = 1 if Si < xi and 0 otherwise. Suppose we can observeSi and Yi, but not Di. Thus we have right-censored observations about Di. This makes the estimation of Fmore involved.

1.4.1 Maximum Likelihood

We show below how we can approach the problem for the simple case of Exponential demand distri-bution with unknown parameter µ using MLEs. We assume that the optimal order quantity xn is de-termined by Equation 1.5, where we use the appropriate estimator µn of µ based on the observations(S1, Y1), (S2, Y2), · · · , (Sn, Yn). We shall assume that µ1 is an arbitrary constant u. We shall now derive arecursive method of computing µn. Define

mn =n∑i=1

Yi, Zn =n∑i=1

Si.

The likelihood of the data (S1, Y1), (S2, Y2), · · · , (Sn, Yn) is given by

Ln = µmn exp−µZn.

The µ that maximizes this likelihood is given by

µn =mn

Zn, n ≥ 1.

The order quantity in period n+ 1 is given by

xn+1 =1

µnln(sc

), n ≥ 1.

Assignment: Do a similar analysis in the Uniform demand case.

1.4.2 Bayesian Updates

The Bayesian analysis is a bit more involved in the presence of censored demands. We show the details herefor the case of iid exponential demands. Our analysis is a special case of the analysis in Ding et al [33].(Also see Harpaz et al [50].) Suppose D1 is Exp(µ), and µ has distribution π0 ∼ Gamma(α, β). Then theunconditional distribution of D1 is given by

F0(x) = P(D1 > x) =

∫ ∞0

1

Γ(α)e−µxβαµα−1e−µβdµ =

(β

x+ β

)α. (1.21)

Hence we start the system with order quantity

x1 = F−10 ((s− c)/s) = β((s/c)1/α − 1).


Then we observe (S1, Y1) on day 1. Based on this we compute the posterior distribution π1 of µ. Directcalculations show that

π1 ∼

Gamma(α+ 1, β + S1) if Y1 = 1,

Gamma(α, β + x1) if Y1 = 0.

Thus the posterior distribution after the first observation is again a Gamma distribution, but its parametersdepend upon whether the first observation is censored or not. The same procedure can be repeated on eachday. Thus suppose the πn is Gamma(αn, βn). Then compute

xn+1 = F−1n ((s− c)/s) = βn((s/c)1/αn − 1). (1.22)

Then order up to xn+1 on day n + 1, and observe (Sn+1, Yn+1). Then the posterior distribution πn+1 isGamma(αn+1, βn+1) where

(αn+1, βn+1) ∼

(αn + 1, βn + Sn+1) if Yn+1 = 1,

(αn, βn + xn+1) if Yn+1 = 0.(1.23)

1.4.3 Optimal Bayes’ Policy

Unlike in the case of observable demands, the above policy may not be optimal for a sequential newvendorproblem. For example, suppose we have to solve the newsvendor problem N times. Then it might makesense to order more than xn on day n during the early stages of the problem (that is for small n), so that welearn the true demand with higher probability. This will yield more accurate information that will providebetter profits during the later stages. This problem was considered by Ding et al [33]. We present a summaryof their analysis assuming iid exp(µ) demands, with Gamma(α1, β1) as the initial prior π1 for µ. The stateon day n is then given by (αn, βn). The decision on day n is xn. The transition probabilities are given byEquation 1.23. The expected reward on any day, if action x is chosen in state (α, β) is given by

R((α, β), x) = sE(min(x,D))− cx,

where the expectation is under the assumption that the distribution of D is given by Equation 1.21. Letvn(α, β) be the maximum total expected profit over n, n+ 1, · · · , N starting with state (α, β). Then wehave the following Dynamic Programming recursion:

vN+1(α, β) = 0,

and for n = N,N − 1, · · · , 0,

vn(α, β) = supx≥0

R((α, β), x) +

∫ x

0

1

Γ(α)vn+1(α+ 1, β + y)αβα/(β + y)α+1dy + vn+1(α, β + x)

(β

β + x

)α.

Let xbn(α, β) is the value of x that maximizes the right hand side. Then xbn is the optimal order quantity touse on day n. Ding et al [33] prove that xbn(α, β) ≥ xn(α, β) (where xn(α, β) = xn is from Equation 1.22using αn = α and βn = β), for all n. This follows our intuition that it may be worth choosing an order sizethat is higher than the myopically optimum one in order to learn more about the actual demand distribution.Unfortunately, it is not easy to numerically evaluate xbn even for the exponential case presented here.

1.5. UNKNOWN NON-PARAMETRIC F , CENSORED DEMANDS 23

1.5 Unknown Non-parametric F , Censored Demands

Finally we consider the case of unknown, non-parametric demand distribution, and censored demands.There are several methods of handling such a case. Here we discuss a few.

1.5.1 Kaplan-Meier Estimator

Let S(1) ≤ S(2) ≤ · · · ≤ S(n) be the ordered values of the sales data If S(i) = Sj , define Y(i) = Yj . Then,assuming the demands are iid with cdf F , the maximum likelihood estimator of P (D ≥ t), for t ≤ S(n), isgiven by ∏

i:S(i)≤t

(1−

Y(i)

n− i+ 1

).

This estimator was first derived and studied by Kaplan and Meier [57]. No estimator is available for t > S(n).The critical assumption in this derivations is that xi’s are known and independent of the S’s. This assumptionis not valid in our setting since xi depends on the observations up to i− 1.

Huh et al [54] study the quantile based ordering policy based on the empirical distribution estimatedusing the Kaplan-Meier estimator. They show that the policy is consistent and robust assuming that thedemands are discrete even if xi depends on the data observed until i− 1.

Assignment: Present the proof of Theorem 2 in Huh et al [54].

1.5.2 Stochastic Approximation

This material is based on the results in Burnetas and Smith [25]. Define Yn = 1 if Sn < xn, and 0 otherwise.Also define r = (s − c)/s. Suppose x1 is chosen arbitrarily and subsequent order quantities are chosen bythe recursion:

xn+1 = xn − xn(Yn − r)/n, n ≥ 1. (1.24)

Thus if Yn = 1, xn+1 = xn(1 − (1 − r)/n) < xn, and if Yn = 0, xn+1 = xn(1 + r/n) > xn. Thatis if we are over-stocked in period n, we reduce the next periods order by a factor (1 − r)/n; and if weare under-stocked in period n, we increase the subsequent order by a factor r/n. Thus we learn from ourmistakes, but the changes in the order size become smaller as n increases. Burnetas and Smith [25] showthat, with probability 1, xn → x∗, the optimal order quantity, when the demand distributions are continuous.We briefly explain the theory behind this.

Equation 1.24 can be thought of as a stochastic approximation recursion, which is given by

xn+1 = xn − anH(xn, Yn), n ≥ 1. (1.25)


Suppose the sequence an, n ≥ 0, called the step-size sequence, satisfies

∞∑n=1

an =∞,∞∑n=1

a2n <∞. (1.26)

Leth(x) = E(H(xn, Yn)|xn = x), σ2(x) = E(H2(xn, Yn)|xn = x).

General results from the theory of stochastic approximation states that if Equation 1.26 holds and thereexists a finite M such that

σ2(x) ≤M(1 + x2),

then the limit points of Equation 1.25 satisfy

limn→∞

h(xn) = 0

with probability 1. (See Kushner and Ying [63].)

Assignment: Present a proof of the above results using Kushner and Ying [63].

In our case we haveH(xn, Yn) = xn(Yn − r), an = 1/n.

Thus Equation 1.26 is satisfied. Also,

h(x) = xE(Yn − r|xn = x) = x(P(Dn < x)− r) = x(F (x)− r),

σ2(x) = x2E((Yn − r)2|xn = x) = x2(F (r)(1− r)2 + (1− F (r))r2).

Thus all conditions for the limit are satisfied. Thus, xn converges to x that satisfies

h(x) = x(F (x)− r) = 0, (1.27)

which has two solutions x = 0 and x = F−1(r). Burnetas and Smith [25] show that xn converges to apositive constant. From the general theory of stochastic approximations we know that x∗ exists and satisfiesEquation 1.27. Hence it must satisfy F (x∗) = r.

We could have considered the recursion in Equation 1.25 with H(Yn, xn) = (Yn − r) and an = 1/n.Then it is easy to conclude that xn converges to F−1(r) with probability 1. However, this recursion can letxn go negative. This problem is avoided if we use the recursion in Equation 1.24.

1.5.3 CAVE Algorithm

This material is based on the results from Godfrey and Powell [45]. They construct a continuous piecewiselinear concave approximation φn(x) to φ(x) of Equation 1.2 based on n observations in an empirical way,roughly as follows. The initial approximation is

φ0(x) = −cx, x ≥ 0.

1.5. UNKNOWN NON-PARAMETRIC F , CENSORED DEMANDS 25

On day n ≥ 1 we choose the order quantity x∗n that maximizes φn−1(·). Thus x∗1 = 0. Then we observeSn = min(Dn, x

∗n). We construct the new approximation φn(·) by introducing a new break point at Sn. If

Dn ≥ x∗n, we set the left slope and the right slope at Sn equal to s − c. If Dn < x∗n, we set the left slopeat Sn to be s − c and the right slope to be −c. (More details needed here). We introduce at most two otherbreak points, one to the left and one to the right of the new break point, to maintain continuity and concavityof the new approximation φn(·). See Godfrey and Powell [45] for more details. We repeat this procedureon each day.

The authors show that, regardless of the actual distribution of D,

1

T

T∑n=1

φ(x∗n)→ φ(x∗),

the optimal expected profit. They do not derive any results regarding the rate of convergence.

Assignment: Code the algorithm and simulate it for iid exponential demands.

1.5.4 AIM Algorithm

This material is based on the results in Huh and Rusmevichientong [53]. They minimize the long runexpected cost. We rewrite their algorithm to maximize the long run expected profit, to be consistent withour common theme. They develop the AIM (Adaptive Inventory Management) algorithm that determinesthe order quantity xn+1 in period n + 1 based on the sales data Sm = min(xm, Dm), (1 ≤ m ≤ n).As before, we use Yn = 1 if Sn < xn and 0, otherwise. No information about the demand distribution isassumed. We begin by assuming that we know that the optimal order quantity x∗ of Equation 1.3 is boundedabove by x. This can be relaxed.

The AIM algorithm computes the order quantity xn, n ≥ 1, recursively. We use the following notation:

bang(x, a, b) =

a if x < a

x if a ≤ x < b

b if b ≤ x.,

εn =x

max(s− c, c)√n,

and

Hn(x) =

−c if Yn = 1

(s− c) if Yn = 0..

We start with x1, an arbitrary quantity, bounded above by x. Then we recursively compute

xn+1 = bang(xn + εnHn(xn), 0, x).

The authors prove that

φ(x∗)− 1

T

T∑n=1

φ(xn) ≤ 2xmax(s− c, c)√T

.


This is an explicit upper bound on the error, and it goes to zero as 1/√T . They show that this bound is tight,

that is, there exists a demand distribution for which the bound is attained.

Note that this is a version of a stochastic approximation algorithm. This seems to converge slower thanthe one given by Equation 1.24.

Assignment: Present the proof of this theorem.

1.6 Data Driven Approach

Now we consider an entirely data driven approach that makes no stochastic assumptions. It assumes thatwe have the data about demands, or sales, and possibly some covariates that may explain the demand (likeprice, weather, holidays, etc.) The aim is derive a functional of the data that will optimize the performance.

1.6.1 Data Driven Approach: Linear Programming

The following material is based on Bertsemas and Thiele [15]. Fix an α ∈ [0, 1] and consider the op-timization problem of deciding the optimal order quantity x that will maximize gα(x) given in Equa-tion 1.10. If the distribution of demands is known, the solution is as given in Equation 1.11. However,we do not know the distribution of the demands, but instead, we are given a sample of n observed demandsD(n) = [D1, D2, · · · , Dn]. Let D(n) be the the nth ascending order statistic of the demand vector. Letnα = dnα+(1−α)e. Thus nα increases from 1 to n as α increases from 0 to 1. Since φ(x,D) is increasingin D, the estimator of E(φ(x,D)|φ(x,D) ≤ φα(x)) is given by

1

nα

nα∑i=1

φ(x,D(i)) =1

nα

(s

nα∑i=1

min(x,D(i))− cx

).

Note that the higher the α, the more of the data that we use to compute this estimate. Thus we want to solvethe following optimization problem:

Maximize 1nα

(s∑nα

i=1 min(x,D(i))− cx))

Sub. to x ≥ 0.

This is a piecewise linear concave function of x. Hence the problem can be reformulated as a linear program.The optimal solution to the above problem is shown to be

x∗(α) = D(r),

wherer =

⌈s− cs

nα

⌉.

In practice we use this order quantity in period n+ 1.

1.6. DATA DRIVEN APPROACH 27

Assignment: Present a proof of this result using Bertsemas and Thiele [15].

Levi et al [68] consider a similar formulation using the minimum cost approach. They call it the sampleaverage approximation, or SAA for short. Thus they consider the problem of minimizing

c(y) =1

n

n∑i=1

[h(y −Di)+ + b(Di − y)+].

One can show that an optimal solution is given by

y = inf

y ≥ 0 :

n∑i=1

1Di≤y ≥b

b+ h

. (1.28)

This is just the optimal order quantity using the empirical distribution obtained using [D1, D2, · · · , Dn].They also prove error bounds on the expected cost from the order quantity derived from the data drivenapproach compared to the true optimal expected cost. We state their result below. Let ε > 0 be a givenaccuracy level and δ ∈ (0, 1] be a confidence level. Suppose

n ≥ 9

2ε2(b+ h)2

min(b, h)ln(2/δ).

Then, assuming that y of Equation 1.28 is computed using n iid random variables,

c(y) ≤ (1 + ε)c(y∗)

with probability at least 1− δ.

Assignment: Present a proof of this result.

Sachs and Minner [92] consider the same problem when we have the sales data, that is, censored demanddata. The model is too detailed to include here.

1.6.2 Demands with Covariates: Linear Regression

The material in this section is based on Beutel and Minner [19]. They assume that the demand Di in periodi depends on the explanatory variable Xi (say price, or temperature) in period i as follows:

Di = β0 + β1Xi + εi,

where εi, i ≥ 1 are uncorrelated errors. Suppose the observations (Di, Xi), i = 1, · · · , n are available.Then the ordinary least square estimates of the regression coefficients are given by

β1 =

∑ni=1(Xi − X)(Di − D)∑n

i=1(Xi − X)2,

andβ0 = D − β1X.


Here

X =1

n

n∑i=1

Xi, D =1

n

n∑i=1

Di.

Now suppose we have observed the value of the explanatory variable to be X the next day. Then thepredicted value of the demand for the next day is

d = β0 + β1X.

If we assume that the errors are iid normally distributed with mean 0 and variance σ2, we see that thevariance of D(X) is given by

a2 = Var(D(X)) = σ2(

1 +1

n+

(X − (X))2∑ni=1(Xi − X)2

),

where

σ2 =

∑ni=1(Di − β1 − β1Xi)

2

n− 2.

Now we can assume that demand D(X) is a N(d, a2), and hence compute the optimal order quantity usingthe formulas from Section 1.1. This analysis can be easily extended to the case when there are more thanone explanatory variables.

Assignment: Present the results for the multivariate case.

1.6.3 Demands with Covariates: Linear Programming

The approach in the previous subsection first estimates the demand distribution using the regression model,and then uses it compute the optimal order quantity. Beutel and Minner [19] next present an integratedapproach where they directly compute the order quantity without estimating the demand distribution. Con-sider the mutivariate case where we have available m explanatory variables X = (X1, X2, · · · , Xm) at thebeginning of the day. We assume that the order quantity B is a linear function of these variables:

B = β0 +

m∑j=1

βjXj . (1.29)

The aim is to decide the optimal parameters β’s.

Let Di be the demand, and Xi = [X1i, X2i, · · · , Xmi] be the vector of explanatory variables on day i,i = 1, 2, · · · , n. Assume (Di, Xi), i = 1, 2, · · · , n is available from the past data. Assume the cost modelof Equation 1.8. Then the total cost of using the order size in Equation 1.29 is given by

n∑i=1

[h(Bi −Di)+ + b(Di −Bi)+].


We try to find the parameters β’s that minimize the above. Using

yi = (Bi −Di)+, si = (Di −Bi)+

we can write this as the following linear program

Minn∑i=1

[hyi + bsi]

such that

Bi −Di ≤ yi, i = 1, · · · , n,

Di −Bi ≤ si, i = 1, · · · , n.

yi, si ≥ 0, i = 1, · · · , n.

Using

Bi = β0 +

n∑i=1

m∑j=1

βjXji,

and rearranging, we get the following linear program

CV : Minn∑i=1

[hyi + bsi]

such that

β0 +

n∑i=1

m∑j=1

βjXji − yi ≤ Di, i = 1, · · · , n,

β0 +

n∑i=1

m∑j=1

βjXji + si ≥ Di, i = 1, · · · , n,

yi, si ≥ 0, i = 1, · · · , n.

The authors present other linear programming formulations corresponding to different objective functions.They also present numerical results using generated data as well as real data.

Assignment: Present the other formulations and the numerical results from [19].

1.6.4 Machine Learning Algorithms

In this section we shall present some of the results from Rudin and Vahn [91]. They start with the LinearProgram Cv of Subsection 1.6.3 and consider the regularized version of it as given below:

CVR : Minn∑i=1

[hyi + bsi] + λ||β||


where λ ≥ 0 is a tuning parameter, and ||β|| is any norm of the vector β. The constraints of CV still hold. If

||β|| = ||β||2 =

m∑j=0

β2j

is the L2 norm, CVR is a quadratic program. If

||β|| = ||β||1 =m∑j=0

|βj |

is the L1 norm, CVR is a linear program. If

||β|| = ||β||0 =

m∑j=0

1βj>0

is the L0 norm, CVR is a mixed linear-integer program program. In each case it can be solved by usingusing existing software.

The authors provide bounds on the error bounds as follows. Let the demands and the features bebounded:

0 ≤ Di ≤ Dmax, |Xj | ≤ Xmax, 1 ≤ j ≤ m.

For a given sample (Xi, Di), i = 1, 2, · · · , n, let β be the solution produced by the optimization problemCV, and β = β(R) be the the result of solving CVR. Define

ψe(β) =1

n

n∑i=1

[h(β0 + βXi −Di)+ + b(Di − β0 − βXi)

+,

be the empirical sample value of the cost. We want to know how much it deviates from the population mean

ψ(β) = E(h(β0 + βX −D)+ + b(D − β0 − βX)+),

where the expectation is computed assuming that X and D have a given distribution with the boundedsupport. The first bound is given by

Theorem 1.2. The following bound holds with probability at least 1− δ:

|ψ(β)− ψe(β)| ≤ 2 max(b, h)2Dmax

min(b, h)

m

n+

(4 max(b, h)2Dmax

min(b, h)m+Dmax

)√ln(2/δ)

n.

Thus, as the sample size increases, the error converges to zero, assuming that m/n goes to zero. This isthe case if the number of features are small compared to the sample size. This is called the low dimensioncase. However, if m/n is large, we have a high dimension case, and the above bound is not very useful. Inthis case the authors give another bound, coming out of the regularized optimization problem CVR.

Theorem 1.3. The following bound holds with probability at least 1− δ:

|ψ(β(R))− ψe(β(R))| ≤ max(b, h)2

min(b, h)

X2max

nλ+

(2 max(b, h)2

min(b, h)

X2max

λ+Dmax

)√ln(2/δ)

n.

This bound does not use m, the number of features. The authors consider many extensions of theseresults.

Assignment: Present the proofs of these theorems.


1.6.5 Deep Learning

The material in this section is based on Oroojlooyjadid et al [85]. They use deep learning methods to simul-taneously solve the problem of forecasting demand and stocking for it. They study a multi-item newsvendorproblem. We shall describe their method in a single item setting to simplify the notation. We begin bydescribing the simplest basic setup of a neural network and how it attempts to solve a problem. See the bookby Kriesel [60] or a very readable article by Sarle [95].

In its simplest setting, a neural network can be thought of as a directed acyclic weighted graph (V,E,w)

with vertex set V , a set E of directed edges, and weights w(u, v) for each edge (u, v) in E. For each nodev ∈ V , let I(v) ⊂ V be the set of input nodes and O(v) ⊂ V be the set of output nodes. That is,

I(v) = u ∈ V : (u, v) ∈ E, O(v) = u ∈ V : (v, u) ∈ E.

If I(v) = ∅, v is called an input (or source) node, and if O(v) = ∅, v is called the output (or sink) node.

The neural network is said to be a layered network with k layers if it has the following structure: the setof input nodes forms the first layer V1, and the set of output nodes is the last layer Vk. The other nodes arepartitioned into subsets V2, · · · , Vk−1 so that the edges in the graph always go from a node in Vi to a nodein Vi+1, 1 ≤ i ≤ k − 1. The layers 2 through k − 1 are called the hidden layers. If k is more than 4, thenetwork is called deep neural network. (There is no universal definition of this.)

An output y(u) from node u is fed as input to node v if (u, v) ∈ E. Thus a node v gets input from all thenodes in I(v). A propagation function f combines all the inputs into a single input number x(v), typicallyas a weighted sum:

x(v) = f(y, w) =∑u∈I(v)

y(u)w(u, v).

Sometimes a node dependent constant w(v), called the bias of node v, is added to the right hand side. Next,an activation function (also called a transfer function) g transforms this aggregated input x(v) into an outputy(v) for node v. A typical transfer function is the sigmoid function with parameter ρ as given below:

y(v) =1

1 + exp(−x(v)/ρ).

Thus y(v) ∈ (0, 1). The transfer function at the input and output nodes are assumed to be identity maps, thatis, the output of such a node is the same as the input at these nodes. The input to an input node u representsan input x(u) from outside, and it equals the output y(u) from that node. Let x = [x(u) : u ∈ I] be theinput vector to the network, where I is the set of input nodes. The output from an output node v representsan output y(v), and it equals the input x(v) to that node. Let y = [y(v) : v ∈ O] be the output vector fromthe network, where O is the set of output nodes. We can think of the neural network as a non-linear functionthat maps an input vector x into an output vector y.

The main idea in neural networks is that one can “train” it. In our context we take training to meanchanging the weight matrix W until a “good” set of weights is obtained. We need to know how to score anoutput to carry out this training. We do this by using a scoring function h that maps an output vector into


a real number, and the smaller the score, the better is the output. We consider the steepest descent learningwhere the weight w(u, v) is changed by ∆w(u, v) given by

∆(u, v) = −η ∂h

∂w(u, v).

Here η > 0 is the learning rate. This computation is particularly easy to do for layered networks, creatingwhat is known as the back-propagation algorithm. It uses the following recursive formulas in the case of alayered network with k layers: First define

δ(v) =

y(v)(1− y(v)) ∂h

∂y(v) v ∈ Vk,(∑u∈Vj+1

w(v, u)δ(u))y(v)(1− y(v)) v ∈ Vj , 1 ≤ j < k.

(1.30)

Then we get∆(u, v) = −ηy(u)δ(v). (1.31)

Then we compute the new set of weights by

wnew(u, ) = wold(u, v) + ∆(u, v), (u, v) ∈ E.

We iterate this until a “reasonable” set of weights is obtained, and declare the network as trained, and readyto process any input from the world. There are many other algorithms for finding a reasonable set of weights.

1

2

3

4

6

7

8

5

w(1, 4)

Input Layer

Hidden Layer

Output Layer

w(4, 6)

x(1)

x(2)

x(3)

y(1)

y(2)

y(3)

x(6)

x(7)

x(8)

y(6)

y(7)

y(8)

Figure 1.3: A Neural Network with one hidden layer.

Example 1.1. We show a small neural network in Figure 1.3. It has 8 nodes and three layers. The setV1 = 1, 2, 3 forms the input layer, the set V2 = 4, 5 forms the hidden layer, and V3 = 6, 7, 8 formsthe output layer. The input vector is x = [x(1), x(2), x(3)]. We have

[y(1), y(2), y(3)] = [x(1), x(2), x(3)].


Using the weight matrix W we compute the input to nodes 4 and 5 as

x(4) = y(1)w(1, 4) + y(2)w(2, 4), x(5) = y(2)w(2, 5) + y(3)w(3, 5).

The output from nodes 4 and 5 is then computed as

y(4) =1

1 + exp(−x(4)/ρ), y(5) =

1

1 + exp(−x(5)/ρ).

The inputs to nodes 6, 7, 8 are then computed as follows:

x(6) = y(4)w(4, 6), x(7) = y(4)w(4, 7) + y(5)w(5, 7), x(8) = y(5)w(5, 8).

The output from the network is

[y(6), y(7), y(8)] = [x(6), x(7), x(8)].

Suppose the scoring function is given by

h = y(6)2 + y(7)2 + y(8)2.

From this we get We have∂h

∂y(v)= 2y(v), v = 6, 7, 8.

Then using Equation 1.30 we get

δ(v) = 2y(v)2(1− y(v)), v = 6, 7, 8,

δ(5) = (w(5, 7)δ(7) + w(5, 8)δ(8))y(5)(1− y(5)),

δ(4) = (w(4, 6)δ(6) + w(4, 7)δ(7))y(4)(1− y(4)),

δ(3) = w(3, 5)δ(5)y(3)(1− y(3)),

δ(2) = (w(2, 4)δ(4) + w(2, 5)δ(5))y(2)(1− y(2)),

δ(1) = w(1, 4)δ(4)y(1)(1− y(1)).

The step size ∆(u, v) can now be computed using Equation 1.31.

Oroojlooyjadid et al [85] use a neural net with 3 layers. Suppose we have N records from the past. Theith record gives the feature values xji j = 1, 2, · · · , p, and the observed demand di (i = 1, 2 · · · , N ). Theauthors use one input node for each of the p features, indexed 1 through p, one hidden node (indexed p+ 1),and one output node (indexed p + 2). They use the first m records for training the net. For the ith record(1 ≤ i ≤ m) we have

x(j) = xji = y(j), 1 ≤ j ≤ p,

x(p+ 1) =

p∑j=1

y(j)w(j, p+ 1),


y(p+ 1) = 1/(1 + exp(−x(p+ 1))),

x(p+ 2) = y(p+ 1)w(p+ 1, p+ 2) = y(p+ 2) = yi(say).

The authors consider two different score functions

h1 =1

m

m∑i=1

h(yi − di)+ + b(di − yi)+

h2 =1

m

m∑i=1

(h(yi − di)+ + b(di − yi)+)2.

To use back-propagation algorithm we need the partial derivatives given below:

∂h1∂w(p+ 1, p+ 2)

=1

m[h|i : yi < di| − b|i : yi ≥ di|],

∂h2∂w(p+ 1, p+ 2)

=2

m

(h∑i:yi<di

(di − yi)− b∑i:yi≥di

(yi − di)).

Then we can use 1.30 and 1.31 in the back-propagation algorithm to modify the weights, and repeat, untilwe find a “good” set of weights. Then they use these weights to predict the order size for data sets i =

m+1, · · · , N . They compare their results with several other competing algorithms and show that the neuralnet algorithm performs better than the rest.

Chapter 2

Data-Driven Offer Optimization

2.1 The Basic Model

We begin with a basic offer optimization problem. Suppose we are selling a house and we get a sequenceof offers. Let Xn be the size of the nth offer. After receiving the nth offer, we can either accept it, or rejectit. If we accept it, we get Xn dollars and the problem terminates. If we reject it, we wait for the next offer.There is no cost to waiting. Once we reject an offer it is no longer available. We are interested in a policythat maximizes the expected value of the accepted offer, assuming we know that there areN offers. SupposeXn, n ≥ 1 are iid non-negative random variables with common cumulative distribution F and mean τ .

If we are buying, andXn is the nth price offered, then we would like to minimize the accepted price. Aninteresting case of this appears in Dutch auction, where the auctioneer starts with a high price, and regularlyreduces it until some one in the audience snaps up the item. In this case, for a single individual the choice isto accept an item at the current asking price, or risk losing the item altogether.

This problem can easily be solved using MDP formulation. Let π be a policy that tells us at time nwhether to accept the current offer or wait for one more, based on offers (X1, X2, · · · , Xn) so far, andassuming that we have not accepted any of them. The policy π can also be thought of as a stopping timefor the sequence X1, X2, · · · , XN. There is a large literature devoted to optimal stopping, arising out ofsequential statistical tests.

In stationary MDP formulation, it is more convenient to index the time backwards so that n means thereare n more offers yet to come. Let vπn be the expected value of the accepted offer if we follow policy π andthere are n more offers to go and we have not accepted any offer so far. Define

vn = infπvπn

where the infimum is taken over all admissible policies π. (A policy is called admissible if it uses only theinformation currently available to make the decision.) We index the offers backwards, so thatXn is the offerwe receive when there are n offers left to go. Clearly, X1 is the last offer, and, if we have not accepted any

35

36 CHAPTER 2. DATA-DRIVEN OFFER OPTIMIZATION

offers so far, we have to accept it. Hence we have

v1 = E(X1) = τ.

For n ≥ 2, the principle of optimality says that we should accept the offer Xn if it is greater than vn−1,otherwise, reject it. Hence

vn = E(max(Xn, vn−1)). (2.1)

This is called the optimality equation, or Bellman equation. This yields, for continuous rewards with densityf ,

vn =

∫ ∞vn−1

xf(x)dx+ vn−1

∫ vn−1

0f(x)dx. (2.2)

Thus the expected offer under the optimal policy is vN .

Example 2.1. Suppose the offers are uniformly distributed over [0, 1]. This case has been worked out inSection 5a of Gilbert and Mosteller [42]. We have v1 = 1/2 and, for n ≥ 2, Equation 2.2 yields

vn = vn−1

∫ vn−1

0dx+

∫ 1

vn−1

xdx = (1 + v2n−1)/2.

Computing recursively:v1 = 1/2, v2 = 5/8, v3 = 89/128, · · · .

Moser [84] proved thatlimn→∞

n(1− vn) = 2.

More precisely, Gilbert and Mosteller [42] report that

vn ∼ 1− 2

n+ log(n+ 1) + 1.767.

Example 2.2. Suppose the offers are exponentially distributed with mean 1. This case has also been workedout in Section 5a of Gilbert and Mosteller [42]. We have v1 = 1 and, for n ≥ 2, Equation 2.2 yields

vn = vn−1

∫ vn−1

0e−xdx+

∫ ∞vn−1

xe−xdx = vn−1 + e−vn−1 .

Computing recursively:

v1 = 1, v2 = 1.3679, v2 = 1.6225, v4 = 1.8199, · · · .

In general, n(1 − F (vn)) converges to a constant that is determined by the limiting distribution ofmax(X1, X2, · · · , Xn). See Kennedy and Hertz [58].

There are several alternate formulations of this problem.

Minimax Regret: Let Y1, Y2, · · · , YN be the ascending order statistics of X1, X2, · · · , XN . Thus thesize of the best expected offer we can hope to get is E(YN ). Hence the expected regret of following policyπ is E(YN )− vπN . The relative regret of following policy π is

RπF =E(YN )− vπN

E(YN ).


Here we specifically emphasize that the regret depends on the underlying distribution F and the policy π.Suppose we know that the cdf F belongs to a class of distributions F . Define

Rπ = supF∈FRπF

to be the worst regret under π over all F in F . We want to find a policy that minimizes this worst case regret.Let π∗ be such that

Rπ∗

= infRπ

where the infimum is taken over all admissible policies π. If such a π∗ exists, it is called the minimax regretpolicy.

Note that E(YN ) does not depend on the policy π, hence if π∗ is a minimax regret policy, it also satisfies

vπ∗

N = supπ

infF∈F

vπN.

Hence the minimax regret policies are also called the maximin reward policies. We shall study two specialcases in Section 2.3.

Waiting Cost: We describe the non-zero cost model in more detail below, based on Sakaguchi [93].Suppose waiting for the next offer costs c > 0 dollars and we can wait for as many offers as we want. Letv(x) be the optimal reward if the current offer is x. If we accept it, our reward is x. If we decide to continue,it costs us c dollars to wait for the next offer, and we face the same problem again. Hence, from the principleof optimality, we get

v(x) = max

x,

∫v(y)dF (y)− c

.

By using value iteration, Sakaguchi shows that the solution is given by

v(x) = maxx, α

where α is given by a solution to ∫ ∞α

(y − α)dF (y) = c. (2.3)

If F is discrete, we should choose the largest α for which the LHS is at least c, or the smallest α for whichthe LHS is at most c. If α < 0, there is no point in even playing this game. Assuming α ≥ 0, it is optimalto take at least one observation, and accept the current observation x if x > α, and continue otherwise.

Example 2.3. For U(0,1) distribution, Equation 2.3 reduces to

(α− 1)2 = 2c.

Assuming c ≤ 1/2, we getα = 1−

√2c.

If c > 1/2, it is not worth playing this game at all, since cost of an observation is more than the expectedvalue of the reward!



2.1.1 Non-stationary Setting

It is fairly straightforward to extend most of the above analysis to the case where X1, X2, · · · , XN areindependent but not identically distributed. Let Fn be the cdf of Xn. Now it is more convenient to indexthe offer index forward. Thus vn(x) is the best expected offer if we have seen offers X1, X2, · · · , Xn−1 andhave not accepted anyone of them, and have now observed Xn = x. The optimality equation becomes

vn(x) = max

x,

∫vn+1(y)dFn+1(y)

, 1 ≤ n ≤ N − 1

withvN (x) = x.

In many cases we get a non-stationary problem even if the offers are iid. We explain with a few examples.Assume that the offers are continuous random variables, so that the probability of ties is zero. Define therelative and absolute ranks of the nth offer as

Rn =

n∑j=1

1Xn ≤ Xj, An =

N∑j=1

1Xn ≤ Xj.

Note that the largest offers have the smallest rank. Since the offers are iid, we see that R1, R2, · · · , Rnare independnet random variables with probability distribution

P(Rn = j) =1

n, 1 ≤ j ≤ n.

More importantly, this is independent of F . Let Rπ and Aπ be the relative and the absolute rank of theaccepted offer under policy π. There are several possibilities here:

1. Find a policy that maximizes the probability of selecting the best offer. This is called the “BestChoice” problem. In this case it does not matter what the actual size of the offer is, we only carewhether it is the best one or not. Suppose we have observed the first n offers and have not acceptedany. If Rn is not 1, then the current offer cannot be the best, and hence we reject it. We haveP(An = 1|Rn = 1) = n/N . Hence if Rn = 1 and we accept it, we get an expected reward of n/N .If we reject it, the probability that next candidate is the best so far is 1/(n + 1). Now let vn be thebest expected reward if we have seen n offers, have not accepted any so far, and Rn = 1. Let wn bethe same, but with Rn 6= 1. Then we get the following optimality equation:

vn = max

n

N,

1

n+ 1vn+1 +

n

n+ 1wn+1

, 1 ≤ n ≤ N − 1,

andwn =

1

n+ 1vn+1 +

n

n+ 1wn+1, 1 ≤ n ≤ N − 1.

We initialize this recursion by setting vN = 1, and wN = 0. One can show that exists an n∗ such thatit is optimal to reject the first n∗ offers, and then select the first one that is the best so far. One can


show that the maximum expected reward and n∗/N both approach 1/e ≈ .3678 as N becomes large.See Gilbert and Mosteller [42] and Freeman [36] for further variations.

This formulation also appears as a solution to the so called secretary problem, where we have Ncandidates that we can interview one at a time. After each interview, we have to either hire thecandidate, or lose her forever, and interview the next candidate. We get a unit reward if we hirethe best candidate, and zero reward otherwise. This problem has a long history, starting in 1875 ina collection of problems proposed by Caley [26]. See Ferguson [35] for a very readable history ofthis problem and its many variations. The problem is also treated in rigorous detail by Chow andRobins [30].

2. Find a policy that minimizes the the expected absolute rank of the accepted offer, that is, maximizeE(Aπ). In this case, define vn(r) as the best expected rank of the accepted offer if we have seen noffers so far, accepted none of them, and the relative rank of the current offer is r. Then one can derivethe following optimality equation:

vn(r) = min

N + 1

n+ 1r,

1

n+ 1

n+1∑i=1

vn+1(i)

, 1 ≤ n ≤ N − 1, 1 ≤ r ≤ n,

withvN (s) = s.

The best expected rank is given by v1(1). One can show that there exists an increasing sequencer∗(n), 1 ≤ n ≤ N such that we accept the nth offer if n is the first offer whose relative rank Rnsatisfies

Rn ≤ r∗(n).

Clearly, r∗(N) = N , so we always accept the last offer if we haven’t accept any of the earlier offers.It is not easy to compute r∗(n). However, it is known that

limN→∞

V1(1) =

∞∏j=1

(j + 1

j

)1/(j+1)

≈ 3.87.

Thus if we follow this policy, we expect to accept on the average the fourth best offer if we have manypotential offers. See Freeman [36] and Lindley [71].

3. Find a policy that maximizes the probability that the absolute rank of the chosen offer is at most k,a given integer, that is maximize P(Aπ ≤ k). When we set k = 1, we get the best choice problem.Let πk be the policy that maximizes this for a fixed value of k. Gussein-Zade [47] gives the structureof πk as follows: there exists a non-decreasing function r∗ : 1, 2, · · · , N → 0, 2, · · · , k suchthat after observing the nth offer it is optimal to accept it if Rn ≤ r∗(n).. Computing this function,however, is not easy. From the best choice problem we know that

limN→∞

P(Aπ1 ≤ 1) = .369.


Gusein-Zade showed thatlimN→∞

supπ

P(Aπ2 ≤ 2) = .574.

Thus the probability of getting one of the top two offers is about .574 if the number of offers is large.Also see Quine and Law [90].

What happens if F is unknown? We consider several alternatives below.

2.2 Unknown Parametric F : Bayesian Approach.

2.2.1 Normal Distribution with Unnown Mean

We consider a special model considered by Sakaguchi [93]. Suppose Xn, n ≥ 1 are iid Normal randomvariables with unknown mean θ and known variance σ2. We assume that θ has an initial prior density thatis normally distributed with mean θ0 and variance σ20 . We know that the posterior distribution of θ after nobservations x1, x2, · · · , xn is normal with mean

θn =1

1σ20

+ nσ2

(θ0σ20

+nxnσ2

),

and varianceσ2n =

11σ20

+ nσ2

.

Here xn is the average of x1, · · · , xn. The unconditional distribution of Xn+1 given the observed meanxn = u is Normal with mean θn and variance σ2n + σ2, both of which are functions of n and u. Denote itsdensity by fn+1(·|u).

Now suppose we have observed values x1, x2, · · · , xn, with xn = x, and current sample average ofthese n values is xn = u. As before, we can stop now and get a reward x. Or, we can pay the cost c and waitfor the next offer Xn+1, which is a random variable with density fn+1(·|u). Let vn(x, u) be the maximumreward from now on. We have

vn(x, u) = max

x,

∫vn+1(y,

nu+ y

n+ 1)fn+1(y|u)dy − c

.

We can solve the above equation if we assume that we can take at most N observations, and if we acceptnone of those offers, we get a reward of 0. Thats is, we set

vN (x, u) = maxx, 0,

and use the backward recursion to compute vN−1(x, u), · · · , v1(x, u) in that order. It is clear that there existαn(u) such that

vn(x, u) = maxx, αn+1(u).

2.2. UNKNOWN PARAMETRIC F : BAYESIAN APPROACH. 41

Sakaguchi [93] gives explicit expressions for αn(u)’s in the special case when σ20 =∞ (this represents thenon-informative prior) and σ2 = 1.

The same methodology can be used for any parametric distribution in the exponential family using itsconjugate prior.

2.2.2 Uniform Distribution with Unknown Upper Bound

The analysis presented here is inspired by the results in Stewart [101]. Suppose Xn, n ≥ 1 are iidU[0, β], where the upper bound β is unknown. We assume that β has the Pareto prior distribution with scaleparameter u0 and shape parameter k0:

P(β > x) =

(u0/x)k0 if x ≥ u0,1 if 0 ≤ x ≤ u0.

This is the conjugate prior for β. Suppose we have observed offers x1, x2, · · · , xn. It is known that theposterior distribution of β is Pareto with scale parameter

un = maxu0, x1, x2, · · · , xn,

and shape parameterkn = k0 + n.

Given kn = k and un = u, the pdf of Xn+1 is given by

fn+1(x|k, u) =

kk+1

1u 0 ≤ x ≤ u

kk+1

uk

xk+1 , x > u.

One can also show thatE(Xn+1|k, u) =

k

k − 1

u

2.

Suppose Xn = x, and k = kn and u = un are computed as above. Let vn(x, k, u) be the maximum rewardfrom now on. We have

vn(x, k, u) = max

x,

∫vn+1(y, k + 1,max(u, y))fn+1(y|k, u)dy − c

.

where fn+1 is as given above. We can solve the above equation if we assume that we can take at most Nobservations, and if we accept none of those offers, we get a reward of zero. (This is slightly different thanhaving to accept exactly one offer.) That is, we set

vN (x, k, u) = maxx, 0 = x,

and use the backward recursion to compute vN−1(x, u), · · · , v1(x, u) in that order. This yields

vn(x, k, u) = maxx, αn+1(k, u).


whereαn+1(k, u) =

∫vn+1(y, k + 1,max(y, u))fn+1(y|k, u)dy − c.

Using this we get

αN (k, u) = E(XN |kN = k, uN = u)− c =k

k − 1

u

2− c.

Thus, at stage N − 1, given that kN−1 = k and uN−1 = u, we accept the offer XN−1 = x if

x >k

k − 1

u

2− c.

It is recommended that we use the non-informative prior u0 = 0 and k0 = 0, although it is a defectivedistribution. In that case we have

kn = n; un = maxx1, x2, · · · , xn.

Stewart shows that there is an r such that it is optimal to ignore the first r offers and then pick the first onethat is greater than the corresponding α value.

Assignment: Verify the calculations given here. Derive an equivalent to Theorem 1 of Stewart [93].

2.3 Unknown Parametric F: Maximin Criterion

In this section we shall study the maximin reward policies, or equivalently, the minimax regret policies fortwo specific instances of F .

Suppose F is a parametric distribution that is invariant under linear transformations. For example,X(a, b) ∼ U(a, b) is invariant under linear transformations since we can writeX(a, b) ∼ a+(b−a)X(0, 1).Similarly, X(µ, σ2) ∼ N(µ, σ2) is invariant under linear transformations since we can write X(µ, σ2) ∼µ+ σX(0, 1).

If the distributions in F have this linear invariance property, it is possible to develop maximin policiesfor this F . Here we consider the particular case where F is a set of linearly invariant distributions F (a, b)

with location parameter a ∈ (−∞,∞) and scale parameter b ∈ [0,∞). Let Xπ be the value of the acceptedoffer if policy π is followed. Define

vπN (a, b) = EF (a,b)(Xπ − a)/b

where the expectation is carried out under the assumption that the offer distribution is F (a, b). We say thata policy π∗ is maximin if it maximizes the minimum expected reward, that is, if

vπ∗

N (0, 1) = supπ

mina,b

vπN (a, b).

That is, for any given π we first compute its worst case performance over all possible distributions in F , andthen choose the π = π∗ (assuming it exists) that maximizes this worst case performance.

We present two cases below.

2.3. UNKNOWN PARAMETRIC F: MAXIMIN CRITERION 43

2.3.1 Uniform Offers

Samuels [94] derive the maximin optimal policy for the case of U(α, β] rewards with unknown upper andlower limits α and β. We shall present his results without proof.

Let N be the maximum number of offers. Assume that N ≥ 3. Define

cN = 1,

and for i = N,N − 1, · · · , 3,

ci−1 = ci(1− ci/2)− ci(1− ci)(i+ 1)

− 1− ci(i+ 1)(i− 2)

.

Lj = minX1, X2, · · · , Xj,

andUj = maxX1, X2, · · · , Xj.

Let π be the policy that continues to wait for the next offer as long as j < N and

Xj − LjUj − Lj

> cj .

(If j = N , we accept the final offer, by necessity.) Samuels shows that this is a maximin policy for N ≥ 6,and gives other specific rules for N = 2, 3, 4 and 5. This is actually a Bayes’ optimal procedure if weassume that the prior density of (α, β) is given by

f(α, β|l, u) = 2(u− l)/(β − α)3,

where l < u are given parameters, and −∞ < α < l < u < β <∞. See Stewart [101].


2.3.2 Normal Offers

Petrucelli [87] derives a maximin policy for the Normally distributed rewards with unknown mean µ andunkown variance σ2. Let ψ(n, ·) be the t-density with n degrees of freedom. We restate his results belowwithout proof. As before we assume there are a maximum of N ≥ 2 offers X1, X2, · · · , XN . For 1 ≤ n ≤N define

Xn =

n∑i=1

Xi/n,

S2n = (

n∑i=1

(Xi − Xn)2)/n,

d(n) =

√2(k − 1)Γ(k/2)

kΓ((k − 1)/2).


Set c(N − 1) = 0, and for n = N − 2, N − 3, · · · , 1, recursively define

a(n) =

√n− 1c(n+ 1)√

(n+ 1)d2(n+ 1)− c2(n+ 1),

b(n) =

((n+ 1)d2(n+ 1)− c2(n+ 1)

(n+ 1)d(n+ 1)

)(n−1)/2,

c(k) =

c(n+ 1)ψ(k − 1, a(n)) +

√n

2π(n+1)b(n), if b(n) > 0,

c(n+ 1), otherwise.

Consider the policy which stops with X2 if X2 > X1 and c(2) < 1/√π, and otherwise stops at the first

n ≥ 3 for whichXn − Xn−1

S2n−1

>c(n)√

d2(n)− c2(n)/n,

or stops at N . Petrucelli shows that this policy is maximin among all rules which take at least two observa-tions.

Assignment: Present the proof of this result.

2.4 Unknown Non-parametric F

Now we consider the case of non-parametric unknown cdf F and discuss several possible alternatives.

2.4.1 Empirical Distribution

We consider the problem with waiting cost c per offer. Suppose we have observed X1, · · · , Xn so far andhave not stopped. Let Fn be an empirical distribution defined as

Fn(x) =1

n

n∑i=1

1Xi ≤ x.

Then compute αn as a solution to Equation 2.3 using Fn in place of F . This implies that we should chooseαn to be the largest α that satisfies ∑

i:Xi>α

(Xi − α) ≥ nc.

We can show that such an α can be computed as follows. Let Y1 ≤ · · · ≤ Yn denote the Xis in increasingorder. If

S0 =

n∑i=1

Yi ≤ nc

then αn = 0. Otherwise let k ∈ 1, 2, · · · , n be the smallest integer such that

Sk =n∑i=k

(Yi − Yk) < nc,

2.4. UNKNOWN NON-PARAMETRIC F 45

and setαn = Yk−1 +

Sk−1 − ncSk−1 − Sk

(Yk − Yk−1).

From the consistency of Fn, it follows that αn converges to the solution α of Equation 2.3 as n becomeslarge. Then the heuristic rule is to accept the current offer Xn if it is greater than αn, otherwise take onemore observation. I have not seen any paper discussing this approach.

Assignment: Verify this and code it.

One can use simulation to compare the expected reward from the heuristic procedure to the exact optimalprocedure.

2.4.2 Scarf’s Inequality

We consider the problem with waiting cost c per offer. Suppose the cdf F is unknown, but we know the meanτ and variance σ2 of the offers. Then we use Scarf’s inequality of Equation 1.19 in the LHS of Equation 2.3,to get ∫ ∞

α(y − α)dF (y) = E((X − α)+) ≤ 1

2((σ2 + (α− τ)2)1/2 − (α− τ)).

It is known that this bound is tight. Hence we are looking for the smallest α that satisfies

1

2((σ2 + (α− τ)2)1/2 − (xατ)) ≤ c.

This yields

α ≥ τ − c+σ2

4c.

Since we are looking for the smallest α, we see that the maximin policy is to accept an offer x if

x ≥ τ − c+σ2

4c,

and wait for another offer otherwise. This will maximize the worst case expected net returns (accepted offerminus the waiting cost) among all offer distributions with mean τ and variance σ2. I have not seen any paperdiscussing this approach either.

2.4.3 Relative Rank

The material in this section is based on Goldenshluger and Zeevi [46]. They give an implementable algo-rithm to compute the optimal policy for the problem considered by Gussein-Zade [47], namely, find a policyπk that maximizes P(Aπ ≤ k). They first establish a simple bound on this maximum as given below.

Lemma 2.1. Let α ∈ (0, 1) be such that

(1− α) ln(1/(1− α)) < 1/N,


and k ∈ 1, 2, · · · , N. Then

P(Aπk > k) ≤ (1− α)k/2 + (1− α)(k−1)α/4 +1− α

1− exp(−3α2/32)exp(−3α2k/32).

In particular, for α = 2/3 and N > 7, we have

P(Aπk> k) ≤ 11 exp(−k/4).

Thus the probability of choosing an offer with absolute rank higher than k decreases exponentially in k.Note that this is independent of the offer size distribution. How does this help us with optimal offer size?Goldenshluger and Zeevi establish the connection between the two via the following lemma:

Lemma 2.2.E(YN − vπN ) ≤ E(YN − YN−k) + E(YN−k − Y1)P(Aπ > k).

It is clear that the relative regret will depend on the behavior of YN , the maximum of X1, X2, · · · , XN.This problem has been studied extensively in the extreme value theory. Here is a summary of the relevantresults: There exists a sequence an, n ≥ 0 of positive reals, and bn, n ≥ 1 of reals, and a cdf Φ suchthat

limn→∞

Fn(x− bnan

)= Φ(x), −∞ < x <∞.

The limiting distribution Φ can take three different forms:

1. Frechet Class: This has parameter α > 0 and support [0,∞) with

Φ(1)(x) = exp(−x−α), x ≥ 0.

2. Reverse Weibul Class: This has parameter α > 0 and support (−∞, 0] with

Φ(2)(x) = exp(−(−x)α), x ≤ 0.

3. Gumbel Class: This has support (−∞,∞) with

Φ(3)(x) = exp(−e−x).

One can show that the exponential distribution belongs to the Gumbel class, and uniform(0,1) distri-bution belongs to the reverse weibul class with α = 1. Goldenshluger and Zeevi establish the followingresult:

Theorem 2.1. Let kN be a sequence of integers such that kN → ∞ and kN/N → 0 as N → ∞. LetπkN be the policy that maximizes the probability of choosing an offer whose absolute rank is at most kNwhen there are a maximum of N offers. If the offer distribution belongs to the Gumbel class or the reverseWieibul class, then the relative regret of following the policy πkN converges to zero. If it belongs to theFrechet class, it converges to a strictly positive constant.

They also describe the second order asymptotic properties of the relative regret. They illustrate with theexample of uniform and exponential offer sizes.

Chapter 3

Data-Driven Staffing of Service Systems

3.1 The Basic Model

Queueing models have been used as models of service systems. We describe several basic models below,see Kulkarni [61]. In each case we study the limiting distribution of X(t), the number of customers in thesystem at time t, and several performance measures in steady state. All models described here assume thatthe customers arrive according to a PP(λ) and demand iid exp(µ) service times.

3.1.1 M/M/1 queue

In this model there is a single server and infinite waiting room. The system is stable if ρ = λ/µ < 1. If thesystem is stable, we have

pj = limt→∞

P(X(t) = j) = ρj(1− ρ), j ≥ 0.

In a stable system, the expected number of customers in steady state is given by

L =ρ

1− ρ,

the expected time in the system is given by

W =1

µ

1

1− ρ,

the expected time in the queue is given by

Wq =1

µ

ρ

1− ρ,

the probability of positive wait and the expected number of busy servers is given by

B = ρ.

47

48 CHAPTER 3. DATA-DRIVEN STAFFING OF SERVICE SYSTEMS

3.1.2 M/M/s queue

In this model there are s identical servers and infinite waiting room. The system is stable if ρ = λ/sµ < 1,or r = λ/µ < s. Define

ρj =

rj

j! 0 ≤ j < sss

s! ρj j ≥ s.

and

p0 =

[ s−1∑j=0

ρj +ss

s!

ρs

1− ρ

]−1.

If the system is stable, we have

pj = limt→∞

P(X(t) = j) = ρjp0, j ≥ 0.

The probability of positive wait, which is the same as the probability of all servers busy, in steady state, isgiven by the Erlang-C formula

C(s, r) =ps

1− ρ=

rs

s!ss−r∑s−1

j=0rj

j! + rs

s!ss−r

.

In a stable system, the expected number of busy servers is given by

B = r,

and the probability that an individual server is busy is given by ρ. The expected queueing time is given by

Wq =1

µ

C(s, r)

s− r.

3.1.3 M/M/s/s queue

In this model there are s identical servers and no waiting waiting room. Thus, the customers who find allservers busy are lost. Let r = λ/µ. The system is always stable, and we have

pj = limt→∞

P(X(t) = j) =

rj

j!∑sk=0

rk

k!

0 ≤ j ≤ s,

which is a truncated Poisson distribution. An interesting feature of this model is that the steady state formulais valid for any service time distribution with mean 1/µ, that is for an M/G/s/s queue. The probability ofblocking, which is the same as the probability of all servers busy, in steady state, is given by the Erlang-Bformula

B(s, r) = ps =rs

s!∑sj=0

rj

j!

.

3.2. STAFFING IN SINGLE-CLASS MANY-SERVER SERVICE SYSTEMS 49

3.1.4 M(t)/M/∞ queue

M/M/∞ is the limiting case ofM/M/s/s queue as s→∞. This queue is always stable and in steady statethe number of customers is a P(r) random variable. This result holds for general service time distributions aswell. In this case it is possible to study the non-stationary version where the arrival process is a NPP(λ(·)).Such a queue is denoted by M(t)/G/∞. In this case we can show that X(t) is a P(m(t)) random variable,where

m(t) =

∫ t

0λ(u)(1−G(t− u))du, (3.1)

and G is the service time distribution.

3.1.5 M/M/s with abandonment

Abandonment is a real phenomena in service systems such as call centers, health care systems, etc. A simplemodel of abandonment assumes that the customers have random iid exp(θ) patience times, and if a customerservice does not start before his patience time expires, the customer abandons the system without service.Such a system is always stable as long as θ > 0. Let

ρj = λj/ j∏

k=1

(min(k, s)µ+ (k − s)+θ), j ≥ 0,

p0 = 1

/ ∞∑j=0

ρj .

Then the limiting distribution is given by

pj = limt→∞

P(X(t) = j) = ρjp0, j ≥ 0,

and the probability of positive wait is given by

A(s, r) =∞∑j=s

pj .

This is called the Erlang-A formula. One can compute the expected queue length as

Lq =∞∑j=s

(j − s)pj ,

and the expected queueing time asWq = Lq/λ.

3.2 Staffing in single-class many-server service systems

Here we consider a service system with a single class of customers. For simplicity we model it as anM/M/s

system. We consider two main results here.


3.2.1 Halfin-Whitt Staffing Rule

A typical staffing problem in an M/M/s service system is to find the number of servers s such that theprobability of positive wait (or the expected queueing time) equals a pre-specified small value α (say .05),but servers are almost always busy. This is called the Quality and Efficiency driven design, or QED for short.This is not possible to do in small systems (that is when the traffic intensity ρ is small), but it is possible forlarge congested systems. We briefly explain the main result here. See Kulkarni [61].

Let h(·) be the hazard rate of a standard normal random variable. It is known that h(−β)/(β + h(−β))

is a decreasing function of β, that starts at 1 for β = 0 and decreases to 0 as β →∞. Let β = β(α) > 0 bethe unique solution to

h(−β)

β + h(−β)= α.

Chooses = r + β(α)

√r, (3.2)

where r = λ/µ. One can show that under this staffing rule (called the square-root staffing formula) theprobability of positive wait satisfies

limr→∞

C(s, r) = limr→∞

C(r + β(α)√r, r) = α.

Also, the probability that any one server is busy is given by

r

s≈ 1− β√

r

when r is large. Thus the servers are busy almost all the time. The asymptotic regime described by Equa-tion 3.3 when r →∞ is called the Halfin-Whitt regime, see Halfin and Whitt [49].

One can use this asymptotic to devise a staffing rule that minimizes cost rate as follows:

Suppose each server costs c dollars per unit time and each customer who faces a positive wait costs adollars. Thus, if we use Halfin-Whit staffing rule with parameter α, the cost rate is

C(α) = cs+ aλα = c(r + β(α)√r) + arµα.

We choose an α to minimize this, by setting

C ′(α) = c√rβ′(α) + arµ = 0

One can show that β′(α) increases from -∞ at α = 0 to −√

2/π at α = 1. (Verify this.) Thus, for large r,there is a unique α ∈ (0, 1) for which

β′(α) = −aµ√r

c.

We get the minimum cost staffing by using this α and the corresponding β(α) in Equation 3.3.

3.3. MULTI-CLASS MULTI-POOL SERVICE SYSTEMS 51

3.2.2 Garnett-Mandelbaum-Reiman Staffing Rule

Garnett et al [41] consider and M/M/s + M system, that is, an M/M/s system where the each waitingcustomer abandons the system without starting service with an exponential rate θ. They extend the Halfin-Whitt formula to account for the abandonments. We present their main result below. Note that we mustnow incorporate δ, the fraction of the customers who abandon in steady state, as part of the performancemeasures.

Let β = β(α, θ) be the solution to √µ/θh(−β)

h(β√µ/θ) +

√µ/θh(−β)

= α.

Unlike in the case of no abandonment, β(α, θ) may be negative in the presence of abandonments. (Thisreduces to β(α) of the previous section if we let θ → 0, and use the asymptotic expression: 1 − Φ(x) ∼φ(x)/x as x→∞. ) Choose

s = r + β(α, θ)√r. (3.3)

Then, as r →∞, we havelimr→∞

P(W > 0) = α,

δ ∼√µ/θh(β

√µ/θ)− β√r

α,

E(W ) ∼ δ/θ,

P(W > t) ∼ α1− Φ(

√µ/θβ +

√θt)

1− Φ(√µ/θβ)

.

Thus the probability of wait is at a pre-specified level α, and the abandonment fraction is small if r is large,etc.

Mandelbaum and Zeltyn [80] construct a minimum cost staffing rule using these asymptotics.

Assignment: Present the minimum cost staffing rule from Mandelbaum and Zeltyn [80].

3.3 Multi-class Multi-pool Service Systems

Harrison and Zheevi [52] consider a service system with m classes of customer, and r server pools. Thecustomers of type i arrive at rate λi. Let λ = [λi] be the vector of these arrival rates. A customer of typei who is not in service reneges at rate γi > 0. There are bk identical servers in the kth pool. If a customerof type i can be served by a server of pool k, the pair (i, k) is called a process. If we denote the allowedcombination (i, k) as process j, its service time is exp(µj). There are a total of n processes. Define matricesR = [Ri,j ] and A = [Ak,j ] as follows: Ri,j = µj if process j serves customers of class i, and 0 otherwise;if process j uses server pool k, Akj = 1, and 0 otherwise. Let xj be the number of servers dedicated toprocess j. Then the [Rx]i is the total processing capacity allocated to class i customers. Since there is


positive abandonment, it makes sense to consider overloaded systems, that is, we assume Rx ≤ λ. Supposeeach abandonment from class i costs pi, and let p = [pi]. They show that abandonment cost per unit time isp(λ − Rx). We must also have Ax ≤ b. Thus, for a fixed λ and b, we need to find an allocation vector xthat solves the following Linear Program (LP):

Minimize π(λ, b) = p(λ−Rx) (3.4)

Subject to: Rx ≤ λ, Ax ≤ b, x ≥ 0. (3.5)

Next they formulate the optimal staffing problem by assuming that λ is a random variable with a givendistribution G. Suppose a server of pool k cost ck per unit time. Then they find the staffing level b to

Minimize φ(b) = cb+ E(π(λ, b)). (3.6)

This is a two-stage stochastic programming problem.

3.4 Time Varying Staffing

The material in this subsection is based on the paper by Green et al [43]. They model a service system asM(t)|G|s(t) + GI queue, where arrival occur according a NHPP(λ(·)), service times are iid with meanτ , cdf G(·), number of servers at time t is s(t), and the patient impatience times are iid. (A patient leavesthe system without service if his queueing time is longer than his impatience time.) The aim is to devisean optimal staffing policy, that is, the function s(·), so that the total staff hours are minimized, subject to aquality level constraint, such as: 80 percent of all service requests must begin service without delay, or theabandonment rate must be less than 5 percent, etc.

PSA: Point-wise Stationary Approximation: At each time t compute the performance measure (suchas the delay probability) assuming a stationary system with arrival λ = λ(t), and compute the neededstaffing level s(t). One can use the Halfin-Whitt square-root formula if appropriate. This is a reasonableapproximation when the services are short and quality requirements are high.

Segmented PSA: Many times, changing staffing levels continuously with time is not feasible, and weare forced to keep them, constant over intervals of length T , called the staffing intervals. Over the nthstaffing interval [nT, (n+ 1)T ), we could compute

sn = maxs(t) : nT ≤ t < (n+ 1)T

and then set the staffing level over the nth interval as sn. This is called the segmented PSA approach.

SIPP: Stationary Independent Period by Period: For the interval [nT, (n+1)T ) consider a stationaryqueueing system with arrival rate

λn =1

T

∫ T

0λ(nT + u)du

3.5. BAYESIAN STAFFING 53

and assume that the system is in steady state over the entire interval. Using this, compute the staffing levelsn. Then use the staffing level function s(t) = sn for nT ≤ t < (n + 1)T . (See Green et al [44]). Thisworks well for short service times, short staffing intervals, and slowly varying arrival rates.

SPHA: Simple Peak Hour Approximation: This method is used for short service times and longstaffing intervals (say 10 minutes service times, and 8 hour staffing interval). Here we compute the peakarrival rate

λn = maxλ(t) : nT ≤ t < (n+ 1)T

and compute an sn assuming a stationary system with arrival rate λn.

OLM: Offered Load Models: When service times are short, we turn to the infinite server models withNHPP arrivals as studied before and compute the m(t) from Equation 3.1. We then use λ(t) = m(t)/E(S)

and then use the PSA or SIPP. This induces a lag in the arrival rate. An even simpler method to induce timelag is to use

λ(t) = λ(t− E(S2)/2E(S)).

Green et al [44] suggest several other approximations that induce such a lag. Also see .

SDPP: Stationary Dependent Period by Period: Yu et al [106] suggest another approximation thatseems to be appropriate when the service times are comparable or longer than the staffing intervals. Let Xn

be the number of patients in the system at the end of the nth staffing interval. Then they estimate the arrivalrate for the next staffing interval [nT, (n+ 1)T ) by

λn =Xn +

∫ (n+1)TnT λ(u)du

T + E(S).

Then the staffing level for the nth interval is computed assuming the arrival process is Poisson with this rateand it is in steady state over the whole interval. The term Xn in the numerator makes behaviour over the nthinterval dependent on what happened in the previous interval. Hence the name SDPP.

There are many papers related to time varying systems. See Madelbaum et al [79] and Massey andWhitt [81].

3.5 Bayesian Staffing

Now we consider how we can do staffing when the parameters of the system are unknown. The naturalcourse to follow is to use the past data to estimate the relevant parameters, and then use the models describedabove to do the staffing. Here we consider the Bayesian approach. That is, we assume that our service centeris an M/M/s + M system described in Section 3.2.2, with arrival rate λ, service rate µ and abandonmentrate θ. We then assume that (λ, µ, θ) has a given prior distribution, and we use the data to compute theposterior distribution. Such an approach is studied by Aktekin and Soyer [3]. We present some of theirresults below.


Since all distributions of interest are exponential, we assume that the priors are independent gammadistributions. In particular, we assume that

λ ∼ Gam(a0, b0), µ ∼ Gam(c0, d0), θ ∼ Gam(e0, f0).

Suppose our data D consists of na number of interarrival times x1, x2, · · · , xna , ns number of service timesy1, y2, · · · , yns , nr number of actual reneging (same as abandonment) times z1, z2, · · · , znr , and nc numberof censored reneging times (where the service started before abandonment occurred)w1, w2, · · · , wnc . Thenwe see that the posterior distribution of λ, µ and θ are independent Gamma distributions given by

λ|D ∼ Gam(a1, b1), µ|D ∼ Gam(c1, d1), θ|D ∼ Gam(e1, f1),

where

a1 = a0 + na, b1 = b0 +

na∑i=1

xi,

c1 = c0 + ns, d1 = d0 +

ns∑i=1

yi,

e1 = e0 + nr, f1 = f0 +

nr∑i=1

zi +

nc∑i=1

wi.

There are two ways to use this information. We can first compute the posterior mean of the parameters as

E(λ|D) = a1/b1, E(µ|D) = c1/d1, E(θ|D) = e1/f1.

We can use these parameters in the M/M/s+M model to exactly compute various performance measuressuch as P(W > 0) and the abandonment rate δ as a function of s, or use the asymptotics developed inSection 3.2.2.

An alternative is to use the posterior distributions to generate n samples (λi, µi, θi), 1 ≤ i ≤ n, andthen estimate these performance measures as a sample average as follows:

P(W > 0) =1

n

n∑i=1

P(W > 0|λi, µi, θi),

δ = P(Ab) =1

n

n∑i=1

P(Ab|λi, µi, θi).

The advantage of this method is that it yields a confidence interval for these performance measures. Theauthors apply this methodology to a call center data from a bank.

Assignment: Present the dependent prior model from Aktekin and Soyer [3].

3.6. DATA DRIVEN STAFFING 55

3.6 Data Driven Staffing

Data driven staffing involves using the past arrival and service time data to forecast future arrivals andservice times, and then using one of the models described in the previous sections to obtain the staffinglevels assuming forecasted arrival and service parameters are the true parameters. It is also possible toforecast the distributions of the parameters, in which case one can try to find the optimal staffing either bysolving a stochastic program or by robust optimization. We illustrate these methods below.

3.6.1 Stochastic Programming

Here we include a brief overview of stochastic programming in the context of staffing. Consider the singleclass single pool model of Section 3.2.2, with arrival rate λ, and abandonment rate θ. Suppose we can routethe arrivals to two different service centers. In center i, (i=1,2), there are si severs, each serving at rate µi,and costing ci per unit time to employ. Suppose we route an arrival to center 1 with probability p and tocenter 2 with probability 1− p. Let λ1 = λp and λ2 = λ(1− p). Let Ai(λi, si) be the fraction of customerswho abandon, and Bi(λi, si) be the fraction of customers who face positive wait in steady state in center i.Suppose each abandonment costs a and each customer with positive wait costs b. Then the total expectedcost per unit time is given by

φ(Λ, p, s) =2∑i=1

[cisi + λi(aA(λi, si) + bB(λi, si))],

where Λ = (λ1, λ2) and s = (s1, s2). We do not know the arrival rate precisely, but know its distributionF . We want to staff the system so as to minimize the expected total cost per unit time.

In one stage stochastic program we simply solve the optimization problem

Minimize E(φ(Λ, p, s))

and obtain the optimal decisions p∗, s∗1 and s∗2.

In two-stage stochastic program we make the decisions in two stages. We assume that the staffing levelss1 and s2 have to be determined before the actual value of the arrival rate λ is observed, since these are noteasy to change. However, we can set p once the actual value of λ is observed. Define, for a given λ and s1and s2

ψ(p) =2∑i=1

λi[aA(λi, si) + bB(λi, si)].

We first solve the second stage problem

Minimize ψ(p), such that 0 ≤ p ≤ 1,

to obtain the optimal value of p, say p∗(λ, s). Then we solve the first stage problem

Minimize c1s1 + c2s2 + E(ψ(p∗(λ, s))).


A common strategy for solving the stochastic programs is to replace the expectations by sums overscenarios. That is, suppose we know from our forecasting model that λ takes k values (called scenarios)with

pi = P(λ = xi), 1 ≤ i ≤ k.

Then the first stage problem is written as

Minimize c1s1 + c2s2 +

k∑i=1

piψ(p∗(xi, s)).

This is called the quadrature method if xi are thought of as the discretization of the random variable λ andpi are computed from F . Alternatively, we can think of xi as the ith sample of λ drawn from the distributionF , and use pi = 1/k. This is called the sample average approximation (SAA) method. The first methodtypically works better, but involves more computation to get pi, which maybe hard to do in multidimensionalproblems.

Gans et al [40] illustrate this methodology with the model from Section 3.2.2 using the SIPP approach ofSection 3.4. They use an SVD (Singular Value Decomposition) method of forecasting to get the distributionof the forecast on day n, using the data from days 1, · · · , n − 1. The forecast is a vector of 24 hourly rateson day n. A duty schedule describes when a server is on duty during the day, for example: 8 to 12 in themorning and then from 2 to 6 in the afternoon. They consider 216 possible duty schedules for the day, anddecide how many servers to put on each of these schedules to create a staffing schedule for the day. Thisis done as the stage two problem, that is assuming the total number of servers for the day is fixed, and thearrival rate profile is as given by the forecast. Then they compute the total number of servers to be used forthat day by solving the first stage problem using the quadrature method.

Assignment: Present the forecasting results of Gans et al [40].

3.6.2 Robust Optimization

There are two alternatives to the stochastic programming formulation: certainty equivalence and robustoptimization. Under the certainty equivalent method, we replace the random variable λ by its mean, andignore the variability. Thus the stochastic problem becomes a simpler deterministic optimization problemand solved as such. Under robust optimization, we make the decisions assuming the worst case scenario.This creates conservative minimax strategies, and may lead to serious overstaffing.

3.6.3 Bassamboo-Zheevi Model

The material in this section is based on Bassamboo and Zheevi [18]. They use the two stage stochas-tic programming methodology, starting with the model described in Section 3.3 studied by Harrison andZheevi [52]. Suppose we have the following data: Fi(t) is the number of arrivals of class i over [0, t). This

3.6. DATA DRIVEN STAFFING 57

is just a counting process. From this data we compute an estimate of the arrival rate process for class i attime s as follows:

λi(s) = (Fi(s+ w)− Fi(s))/w

where w is the window size. Let λ(s) be a vector of these estimated arrival rates. In order to compute anestimate of the cdf G of the arrival rates, assume that past data is divided into N intervals of length T . Fora given vector λ, compute the empirical joint distribution of the arrival rates over the next segment by

G(λ) =1

NT

N∑n=1

∫ nT

(n−1)TIλ(s) ≤ λds.

We use this to compute the staffing level b for the next segment by solving

Minimize φ(b) = cb+

∫π(λ, b)dG(λ). (3.7)

They propose the following “gradient” algorithm to compute the “optimal” staffing level b.

1. Set b = initial guess for the staffing level.

2. Solve the LP of Equations 3.4 and 3.5, and compute the shadow prices ψ(λ, b) corresponding to theconstraint Ax ≤ b. Let ψ(b) the expectation of ψ(λ, b) assuming that G is the cdf of λ. Then set

δb = c+ ψ(b).

3. If ||δb|| is “small”, stop. Else, set b = b+ αδb, where α is the step size, and go back to Step 1.

They recommend that the step size α should be a decreasing function of the iteration number, for exampleα = 1/i for the ith iteration. The intuition behind the above algorithm is that c+ ψ(b) is the gradient of theobjective function in the optimization problem of Equation 3.6.

The authors consider two main sources of error: the LP formulation is based on a fluid model, and henceis an approximation of the actual cost of staffing. Secondly, the estimates of the arrival rates are based on afinite sample data, and hence are an approximation of the actual parameters. They show that as the systemsbecome “large” (that is, fluid limits hold), and sample size increases, both these errors go to zero. Hencethe algorithm is expected to produce near optimal staffing level under these assumptions. It is crucial toremember that they assume that the data over different segments are iid.

Assignment: Present the numerical results of Section 6 of Bassamboo and Zheevi [18].

Assignment: Present the theoretical results of Section 7 of Bassamboo and Zheevi [18].

Bertsemas and Doan [12] consider the Bassamboo and Zheevi model of a call center as described above.They assume that the time is slotted into short intervals and the parameters of the model remain constant overeach time slot. They use real data from a US bank to estimate the arrival, service, and abandonment rates foreach time slot. They study an optimization problem similar to the one in Equation 3.6, except they use the


αth quantile in place of the expected value. This is along the same lines as the development in Section 1.6.1.They show how this optimization problem can be reduced to a linear program whose parameters can beestimated from the data.

Assignment: Present the results of Bertsemas and Doan [12].

See Arkin et al [5] for a review of the literature related to call centers until 2007.

3.7 Data Driven Staffing: A Case Study

This material is based on a working paper by Yu et al [106]. A virtual computing lab (VCL) is operated byUNC’s IT department. It consists of about 800 PCs that can be remotely used by users. A user can ask for aPC to be loaded with with a desired software, say with MATLAB and Word and Excel. Once such a PC ismade available, the user uses it for a random amount of time and then releases it.

In the simplest mode of operation, when a user requests a PC with a given combination of software, thatcombination is loaded on a PC (if available) and is made available to the user. If all PCs are in use, the userrequest is rejected. When the user is done, she releases the PC, and it is then scrubbed of the user data andbecomes available for any other user. The loading and unloading operation takes a couple of minutes, andhence each user is delayed. Thus a customer is either delayed, or rejected. We call this Policy 1.

In a modified mode of operation, the VCL has identified 400 most commonly requested software com-binations. Then each combination is loaded on two PCs. When a user requests any of these combinations,and one is available, the user gets it immediately, with almost zero delay. If not, the VCL loads one of theavailable PC with the requested software, and makes it available after a delay. Thus a customer is eithersatisfied, or delayed, or rejected (if no PCs are available). We call this Policy 2.

The VCL has given us access to about three years worth of data. Each record of this data set includes thetime the request was made (arrival time), what software combination was requested (customer class), andhow long the customer occupied the PC (service time). Our mission is to construct a better operating policy.

We consider the following operating policy: a PC can OFF or ON. An ON PC can be dedicated to aparticular class (that is, pre-loaded with that class of software), or flexible (that is, no software is loaded).We need to decide how many PCs should be dedicated to each of the classes, and how many should be leftflexible. When a customer of a given class arrives, we give him a PC from the dedicated pool of that classif one is available. If not, the customer is delayed, and we load one from the flexible pool for this purpose.If no PC is available from the flexible pool, we call the customer rejected. In practice we turn on a new PCand give it to her. However, this will take over ten minutes. We evaluate two performance measures: thepercentage of customers who are delayed, and the percentage who are rejected. We want to devise a policythat will have at most α fraction of customers delayed, and at most β fraction of the customers rejected,using the least number of ON PCs. This is to minimize the power consumption, which is a large componentof the operating cost.

3.7. DATA DRIVEN STAFFING: A CASE STUDY 59

3.7.1 Arrival Data

Figure 3.1 (a) shows the aggregated hourly arrivals from August 1, 2008 to July 31, 2011. Panels (b) and (c)show the average hourly arrivals in each hour of the week (starting with Sunday midnight) and each hour ofthe day (starting with midnight) respectively.

Time Since 08/01/2008 (hr)#1040 0.5 1 1.5 2 2.5

Num

ber

of A

rriv

als

0

50

100

150

200

250(a)

Hour in Week0 50 100 150

Ave

rage

Num

ber

of A

rriv

als

0

5

10

15

20

25

30

35

40

45

50(b)

Hour in Day0 5 10 15 20

Ave

rage

Num

ber

of A

rriv

als

0

5

10

15

20

25

30

35

40(c)

Figure 3.1: Aggregated Arrivals

Similar behavior is exhibited by the arrival patterns of the individual classes. We see clear semesterly,weekly, and daily cycles. We use the data up to the nth hour to predict the arrival rates in the (n+ 1)st hour.We have tried two prediction approaches: Moving Average, and SVD forecasting. We then use two staffingapproaches: SIPP uses the predicted arrival rate in the (n+ 1)st hour as the true arrival rate, SDPP modifiesit based on the current state to compute the arrival rate in the (n+ 1)st hour, as described in Section 3.4.

Service time data is also available, but does not show any discernable time dependence, and their distri-butions seem reasonably close to exponential distribution (depends on the class). Hence we do not do anyforecasting of service times.

3.7.2 Decision Model

We model the number of users of class i in the system during the nth hour as an M/G/∞ system in steadystate with arrival rate λin and service times τ i. ThusXi

n, the number of users of class i during the nth hour, isa P(ρin) random variable with parameter ρin = λinτ

i, with mean and variance given by ρin. We then solve thefollowing optimization problem to compute sin, which is then used to compute din, the number of dedicatedservers to be used for class i during the nth hour:

Minimize∑

i sin

Such that∑

iλin∑j λ

jnP(Xi

n > sin) ≤ α.


Using the P(ρin) to compute the probabilities in the above formulation did not work too well due toover-dispersion in the arrival data. Hence we used the following Chebyshev’s inequality:

P(Xin > sin) =≤ min

(1,

ρin(sin − ρin)2

).

The optimization problem is not easy to solve due to two reasons: the constraint is complicated due to thepresence of the min operator, and the decision variables are supposed to be integers. We develop a methodof solving the above optimization problem which guarantees to produce a local minimum (which we thinkis global minimum) if integrality is ignored. Then we set the number of of dedicated servers as

din = Bin + sin,

where Bin is the number of class i users in the system at the beginning of the nth hour.

We then estimate λfn, the arrival rate to the flexible pool, which is just the sum of the overflow ratesfrom all the dedicated user pools. We than compute the sfn by modeling the number in the dedicated poolas an M/G/∞ queue, with mean service time that is computed as the weighted average of all the meanservice times of all the classes (weighted by the overflow rates). Let τ fn be this weighted mean (note thatthis depends on n, since the weights change with n). Then compute ρfn = λfnτ

fn , and using Chebyshev’s

inequality obtain

sfn = ρfn +

√ρfn/β.

If Bfn is the number of users using the flexible pool servers, we set the number of flexible pools serves for

the nth hour asdfn = Bf

n + sfn.

This is called Policy 3.

3.7.3 Performance

We simulated the system with the trace data using Policy 1, Policy 2, and the proposed Policy 3. UnderPolicy 1, every user is delayed, but hardly any one is rejected, and it keeps all 800 PCs on all the time.Under Policy 2, about 15% of the users are delayed, and hardly any one is rejected, and all the 800 PCsare on all the time. Under Policy 3, we have tried four combination: SVD + SIPP, SVD+SDPP, MA+SIPP,MA+SDPP. The number of on PCs, and the delay and rejection rates are plotted in Figures 3.2 and 3.3. Thetarget delay rate is 5% and the target rejection rate is .5%. It is clear that SVD far outperforms the MA asa method of forecasting, and SDPP slightly outperforms the SIPP. Hence we propose that Policy 3, withSVD forecasting and SDPP approximation as a reasonable policy. Under this policy, less than 5% of thecustomers face delay, and less than .5% of the customers are rejected, thus meeting the targets. It keeps atthe most 475 PCs on, thus providing over 40% saving in energy consumption. This policy can easily beautomated always using the past thirty days of data.

3.7. DATA DRIVEN STAFFING: A CASE STUDY 61

Hour in Day0 5 10 15 20 25

Pro

babi

lity

of D

elay

/Blo

ckin

g

0

0.01

0.02

0.03

0.04

0.05SIPP, SVD, Mean

Prob. of DelayProb. of Blocking

Hour in Day0 5 10 15 20 25

Ave

rage

Num

ber

of O

N S

erve

rs

0

100

200

300

400

500

ON ServersDedicated ServersFlexible Servers

Hour in Day0 5 10 15 20 25

Pro

babi

lity

of D

elay

/Blo

ckin

g

0

0.05

0.1

0.15

0.2

0.25SIPP, Moving Average, Mean


Hour in Day0 5 10 15 20 25

Ave

rage

Num

ber

of O

N S

erve

rs

0

100

200

300

400

500


Figure 3.2: Performance Measures for Policy 3 using SIPP approximation


Hour in Day0 5 10 15 20 25

Pro

babi

lity

of D

elay

/Blo

ckin

g

0

0.01

0.02

0.03

0.04

0.05

0.06SDPP, SVD, Mean


Hour in Day0 5 10 15 20 25

Ave

rage

Num

ber

of O

N S

erve

rs

0

100

200

300

400

500


Hour in Day0 5 10 15 20 25

Pro

babi

lity

of D

elay

/Blo

ckin

g

0

0.05

0.1

0.15

0.2

0.25SDPP, Moving Average, Mean


Hour in Day0 5 10 15 20 25

Ave

rage

Num

ber

of O

N S

erve

rs

0

100

200

300

400

500


Figure 3.3: Performance Measures for Policy 3 using SDPP approximation

Chapter 4

Data-Driven Revenue Management andDynamic Pricing

Revenue management involves selling a product or service to the right customer at the right price at the righttime. It was pioneered by the airline industry ([100]), and then adopted by the hospitality industry ([83]). Itis being adopted by more and more industries as its efficacy becomes apparent. Talluri and van Ryzin [103]is an excellent text on this topic. Den Boer [22] is a very nice survey paper on dynamic pricing in a varietyof settings.

4.1 Yield Management: Fixed Pricing

We begin by discussing yield management problem initially arising out of Airline booking problems.

4.1.1 Basic Model

The material is based on several papers: see Belobaba [11] and Brumelle and McGill [24].

Consider the problem of selling tickets on a single leg airline flight. Suppose there are two types ofcustomers: class two customers are the discount customers, who purchase tickets at the discount price f2,and class one customers are the regular customers who purchase the tickets at the regular price f1 > f2.(Assume that f1 and f2 are fixed.) All the discount customers arrive before any of the regular customers.Suppose the number of regular customers is a random variable X1, with known distribution F1.

Suppose there are k seats left unsold, and a discount customer arrives. Should we sell the ticket to himand get f2 in revenue, or reject him, and hope to sell it to a future regular customer? If we reject the discountcustomer, then the probability that we sell the kth empty seat to a regular customer is P(X1 ≥ k). Hence theexpected revenue is P(X1 ≥ k)f1. Hence we should reject the discount customer when there are k unsold

63

64 CHAPTER 4. DATA-DRIVEN REVENUE MANAGEMENT AND DYNAMIC PRICING

seats ifP(X1 ≥ k) ≥ f2

f1.

This implies a protection level policy: protect θ1 seats for the regular customers, where θ1 is the largestnumber for which

P(X1 ≥ θ1) ≥f2f1.

Just as in the newsvendor models, it is common to assume that X1 is a continuous random variable, anddefine

θ1 = F−11

(f1 − f2f1

).

This θ1 is called the protection level for class 1 (regular) passengers. Define the Expected Marginal SeatRevenue from class 1 customers (EMSR1) as the expected increase in revenue if the number of seatsreserved for them is increased by one. Thus we can interpret the above policy as follows: the protectionlevel θ1 is chosen so that the EMSR1 equals the revenue from class two customers.

This simple model can be easily extended to multiple classes of customer with nested protection levelsas follows. Suppose there are k+1 classes with fares fk+1 < fk < · · · < f1. Customers of class k+1 arrivefirst, then those of class k, and so on, and finally customers of class 1 arrive. Suppose there are protectionlevels θk > θk−1 > · · · > θ1 that determine the ticketing policy as follows: we sell a ticket to class i + 1

customer if there are more than θi seats still available. Let Xi be the total demand for customers of class 1through i, assumed to be a continuous random variable with cdf Fi. Using the above argument about EMSR,we see that the thresholds θi are given by

P(Xi ≥ θi) =fi+1

f1= ri+1, 1 ≤ i ≤ k. (4.1)

Such a revenue management policy is called a nested threshold policy.

Assignment: Present the results from Belobaba [11].

Assignment: Present the results from Brumelle and McGill [24].

There are many possible variations and extensions of this model. We refer the reader to many reviewpapers on this, see, for example, McGill and van Ryzin [82].

The natural question is: what to do if the demand distributions are not known? We discuss one approachbelow.

4.1.2 Stochastic Approximation

The material in this section is based on van Ryzin and McGill [104]. They assume the k + 1 class model ofthe previous section. Let

θ = [θ1, θ2, · · · , θk]

be the nested thresholds, andX = [X1, X2, · · · , Xk]

4.2. FINITE INVENTORY, DYNAMIC PRICING 65

be the vector of the cumulative demands. Suppose the same flight departs everyday, and operates in astationary environment. They assume that the data is available for past n days. Let the threshold vector inplace for day m be θm, and the demand vector be Xm. Suppose θm is known for m = 1, 2, · · · , n, as arethe indicator events

Ami (θm, Xm) = 1Xmi ≥ θmi , 1 ≤ m ≤ n, 1 ≤ i ≤ k.

DefineHi(θ,X) = ri+1 −Ai(θ,X),

andH(θ,X) = [H1(θ,X), H2(θ,X), · · · , Hk(θ,X)].

Note that Hi(θ,X) = ri+1 > 0 if the protection level for the first i classes is not reached, otherwise it isri+1 − 1 < 0. In the first case we should reduce the protection level θi and in the second case we shouldincrease it. This precisely what the following Robinson-Monroe stochastic approximation algorithm does:

θn+1 = θn − γnH(θn, Xn),

where γn, n ≥ 1 is a sequence of scalars such that

∞∑n=1

γn =∞,∞∑n=1

γ2n <∞.

This is a vector version of the stochastic approximation algorithm for the the newvendor problem presentedin Section 1.5.2. Under appropriate conditions this algorithm converges to a vector θ that satisfies

E(H(θ,X)) = 0,

which is same as equations 4.1.

There are two problems with this algorithm. First, the usual conditions for convergence don’t hold.Second, θn may converge to a vector θ whose components are not monotonically increasing, and may evenbe negative! Ryzin and McGill provide a proof of convergence (which is long and technical) and methodsof circumventing the problems of negative values, and non-monotonic protection levels.

Assignment: Present the results of Sections 4 and 5 of van Ryzin and McGill [104].

4.2 Finite Inventory, Dynamic Pricing

In the airline revenue management problem we had assumed that the prices were fixed, and we optimizedthe booking policy. In the dynamic pricing model, we assume we can control pricing but booking policyplays no role.


4.2.1 A Simple Two-period Model

We present a very simple two-period model considered by Lazear [65]. We have one item and one customer.The customer valuation of the product is a random variable V with known cdf F . The probability that thecustomer will buy the item at price p is P (V > p) = 1 − F (p). If we have only one chance to sell, wewould choose a price p1 that maximizes p1(1− F (p1)). If we have two chances, we would set the price tobe pi in period i. If the item does not sell at price p1 in period one, offer it at price p2 in period 2. Note thatif the item does not sell at price p1 in period 1, it will sell at price p2 ≤ p1 in period 2 with probability

P (V > p2|V ≤ p1) = (F (p1)− F (p2))/F (p1), 0 ≤ p2 ≤ p1.

Thus we find prices 0 ≤ p2 ≤ p1 that maximize

p1(1− F (p1)) + p2(F (p1)− F (p2))

F (p1)F (p1) = p1(1− F (p1)) + p2(F (p1)− F (p2)).

For example, if V ∼ U(0, 1), F (p) = p, 0 ≤ p ≤ 1, the optimal one period policy is p1 = 1/2, withexpected revenue 1/4, while the optimal two period policy is p1 = 2/3 and p2 = 1/3, with expected revenue1/3!

This model assumes that the customer is a price taker. However, in practice customers can be strategic,in which case this becomes a problem of game theory. Lazear [65] studies such a game theoretic model. Weshall assume that customers are non-strategic in this chapter.

4.2.2 Discrete Time Model

A standard dynamic pricing model assumes that we have a finite initial inventory that we want to liq-uidate over a finite time horizon T . Let Xt be the inventory on hand at the beginning of time periodt = 0, 1, 2 · · · , T −1. At time t we decide on the price pt, which induces a random demandDt (presumablya decreasing function of pt). We incur a holding cost hXt and earn a reward Dtpt and the inventory reducesto

Xt+1 = (Xt −Dt)+.

The remaining inventory at time T is discarded at cost s. The aim is to devise a pricing policy that maximizes

φ(x) = E

( T−1∑t=0

[min(Dt, Xt)pt − hXt]− sXT

∣∣∣∣X0 = x

).

Suppose we know thatP(Dt ≤ y|pt = p) = F (y|p),

We can solve this problem by the formulating an MDP by defining φk(x) as the maximum expected revenueif there are k more periods to go and the current the current inventory level is x. Then the backward dynamicprogramming equations become

φ0(x) = −sx,

4.2. FINITE INVENTORY, DYNAMIC PRICING 67

φk(x) = maxp−hx+ Ep(pmin(x,D) + φk−1((x−D)+)), k = 1, 2, · · · , T.

Here Ep represents expectation with respect to F (·|p). This is an easy recursion to numerically evaluate,especially if D is an integer valued random variable. The optimal revenue is given by φT (C) if C is theinitial inventory.

Example 4.1. Suppose, given the price p ∈ [0, 1], the demand is a Binomial (n, 1 − p) random variable.The initial inventory is N , and the time horizon is T . Then the above equations become

φ0(i) = −si, 0 ≤ i ≤ N,

φk(0) = 0, 1 ≤ k ≤ T,

φk(i) = maxp

− hi+

i∑j=0

[pj + φk−1(i− j)]P (D = j) + piP (D > i)

, 1 ≤ i ≤ N, 1 ≤ k ≤ T.

We useh = 0, s = 0, T = 5, N = 20, n = 10,

and numerically solve the DP equation. Figure 4.1 shows the five graphs for φk(i) vs. i for k = 1, 2, 3, 4, 5.The lowest curve corresponds to k = 1 and the topmost is for k = 5. Note that φk(i) increases in k andi. The optimal price pk(i) to be used when the inventory is i and there are k more days to go is plotted inFigure 4.2. The lowest curve is for k = 1 and the topmost is for k = 5. The optimal price is a decreasingfunction of the inventory, and an increasing function of k, the number of days remaining. This makesintuitive sense.

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

Figure 4.1: Optimal revenue as a function of current inventory.

At the beginning, with 5 days to go and starting inventory of 20, the optimal price is .63 and the optimalrevenue is 11.064.


0 2 4 6 8 10 12 14 16 18 200.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Figure 4.2: Optimal price as a function of the current inventory.

There is not much literature about discrete time model. Bitran and Mondschein [20] present a discretetime non-stationary model in Section 3 of their paper. Federgruen and Heching [34] consider a discretetime model, but with the possibility of replenishing the inventory, and study the structural properties of thejoint inventory-pricing policies. The structural results in these papers are similar to the ones observed in thenumerical example presented above.

4.2.3 Continuous Time Model

Most of the dynamic pricing models in the literature are set in continuous time. Here we present the modelstudied by Gallego and van Ryzin [38]. Suppose the initial inventory is C and we want to clear it by time T .Customers arrive according to a PP(Λ). If the price is p at the time of arrival, the customer will buy the itemwith probability α(p), which is a decreasing function of p. Let λ(p) = Λα(p), and assume that its inversefunction p(λ) exists. Thus, if we want to sell items at rate λ, we should set the price to be p. Thus we canthink of λ as a decision variable instead of p. The revenue rate of using sales rate λ is given by

r(λ) = λp(λ).

Let X(t) be the inventory at time t, φ(k, t) be the maximum expected revenue if the inventory is k andthere are t more time units to go. Doing an infinitesimal analysis, under weak smoothness assumptions onon r(·), we get the following Hamilton-Jacobi-Bellman (HJB) equation,

d

dtφ(k, t) = sup

λ≥0r(λ)− λ(φ(k, t)− φ(k − 1, t)), (4.2)

with boundary conditionφ(0, t) = 0, ∀t, φ(k, 0) = 0, ∀k.

4.3. DYNAMIC PRICING WITH UNKNOWN DEMAND FUNCTION 69

Once we have φ(k, t), for all k and t, we can compute the optimal λ∗(k, t) as the value of λ that maximizesthe RHS of the above equation. Once we have λ∗, we have the optimal price p∗. Gallego and van Ryzin [38]derive several structural properties of the value function φ and the optimal pricing policy p∗.

Equation 4.2 has to be computed numerically. However, an analytical solution is possible for the case

λ(p) = Λe−p, p ≥ 0. (4.3)

Then r(λ) is maximized atλ∗ = Λ/e.

The solution is

φ(k, t) = log

( k∑i=0

(λ∗t)i

i!

).

and the optimal price is given by

p∗(k, t) = φ(k, t)− φ(k − 1, t) + 1.

The question is: what to do when the demand distribution is unknown? We discuss several approachesbelow.

4.3 Dynamic Pricing with Unknown Demand Function

4.3.1 Discrete Time Model: Linear Regression

Bertsemas and Perakis [14] consider a discrete-time finite-horizon (T ) model with a parametric demandfunction with unknown parameters. The idea is to use the data to estimate these parameters and then usethese parameters to make the pricing decisions. This clearly creates a tradeoff between learning and earning.They assume that the demand Dt at time t depends on the price pt as follows:

Dt = a+ bpt + εt,

where εt, t ≥ 1 are iid N(0, σ2). Here a > 0, b < 0 and σ2 ≥ 0 are unknown parameters. Let Xt be theleftover inventory at time t, with X1 = c being the initial inventory. We have

Xt+1 = (Xt −Dt)+, 1 ≤ t ≤ T.

We need to determine pt based on (Dk, pk), 1 ≤ k ≤ t− 1 and Xt. They assume that p is constrained tolie in a fixed interval [pmin, pmax].

Consider the deterministic case where the parameters a and b are known and σ2 = 0. Write ds = a+bps,Xt = xt. and the problem reduces to the following deterministic optimization problem


MaximizeT∑t=1

min(dt, xt)pt,

Such that dt = a+ bpt, t = 1, 2, · · · , T,

xt+1 = (xt − dt)+, 1 ≤ t < T,

pmin ≤ pt ≤ pmax, t = 1, 2, · · · , T.

They show that the optimal policy is given as follows. Let

pt = max

− a

2b,−(T − t+ 1)a− xt

(T − t+ 1)b

.

The optimal solution to the above optimization is then given by

p∗t = H(pt, pmin, pmax), t = 1, 2, · · · , T,

where H(x, a, b) (with a < b) is a function such H(x, a, b) = a if x < b, = x if a ≤ x ≤ b, and = b ifx > b. Note that

xt ≥ (T − t+ 1)a/2⇒ pt = −a/(2b)

which is the price that maximizes the revenue dtpt in the tth period (assuming it is feasible). Thus, wefollow a myopic policy in the beginning, until the inventory reduces to a small enough level. Then we startcharging a price higher than the myopic optimal price.

Assignment: Present a proof of this result from Bertsemas and Perakis [14].

Next consider the case where a and b and σ2 are unknown, but we follow a myopic policy. That is, wefirst compute the estimates at and bt of a and b at time t and then use the above deterministic policy usingthese estimates in place of a and b. The estimates at and bt are given by the linear regression coefficients:

bt =(t− 1)

∑t−1k=1 pkDk −

∑t−1k=1 pk

∑t−1k=1Dk

(t− 1)∑t−1

k=1 p2k − (

∑t−1k=1 pk)

2,

at =

∑t−1k=1Dk

t− 1− bt

∑t−1k=1 pkt− 1

.

It is possible to develop an efficient recursive procedure to compute (at+1, bt+1) in terms of (at, bt).

The main drawback of this procedure is that it ignores the fact that the pricing decision in period t willaffect the estimates in period t + 1. To address this drawback, the authors formulate a Markov decisionprocess to find the optimal policy. Unfortunately this leads to a six dimensional state vector

(Xt,

t−1∑k=1

pk,

t−1∑k=1

p2k,

t−1∑k=1

Dk,

t−1∑k=1

D2k,

t−1∑k=1

pkDk).

So this approach is not very implementable.


4.3.2 An n-period Model

Babaioff et al [8] consider an n period problem with a finite initial inventory of k items. They assume thatthe allowed prices are bounded and form a discrete set P = δ(1 + δ)i ∈ [0, 1] : i = 0, 1, 2, · · · . For eachprice p ∈ P and each time t ≤ n, they define an index It(p) with the following interpretations: At each timet we use the price p with the highest index. Let St(p) be the observed sales rate over 0, 1, · · · , t− 1 whileusing price p, and rt(p) be half width of the confidence interval for it. They define

It(p) = pmaxk, n(St(p) + rt(p)).

They derive results bounding the regret of using this policy over any other fixed price policy.

4.3.3 Minimax Regret

Several researchers have explored pricing policies that minimize the maximum expected regret. We presentsome relevant work here.

We begin with the results of Besbes and Zheevi [18]. They consider a continuous time pricing modelas described in Section 4.2.3. They assume that the arrival rate function λ(p) is unknown, and has to beinferred from the observed data. The price p is constrained to lie in the interval [pmin, pmax]. In addition, wecan also set the price high enough to make λ(p) = 0.

First, consider the deterministic case where the demands occur continuously at rate λ(p) if the price isp. Suppose the initial inventory is x and T is the length of the finite horizon. The optimal revenue is thengiven by

JD(x, T |λ) = sup

∫ T

0p(s)λ(p(s))ds

such that∫ T

0λ(p(s))ds ≤ x, p(·) feasible.

They show that the optimal policy is given as follows: Let

pu = argmaxppλ(p), pc = argminp|λ(p)− x/T |,

pD = maxpu, pc, T ′ = minT, x/λ(pD).

Note that pc is the largest price which will help us sell all x untis of inventory over the selling period T .Then the optimal policy is to use the price p(s) = pD as long as inventory is positive, or the finite horizonis reached. It is interesting that the price p(s) is not a function of the inventory x(s). This is a consequenceof the fact that this is a deterministic setting.

Now consider the case where λ(p) is the rate of a NPP. Let π be an admissible pricing policy. LetJπ(x, T ;λ) be the expected revenue from this policy π. Then we know that

Jπ(x, T ;λ) ≤ JD(x, T |λ).


Hence the regret of using π can be quantified as

Rπ(x, T : λ) = 1− Jπ(x, T ;λ)

JD(x, T |λ).

Note that the regret is in [0,1]. The worst regret under π is the supλRπ(x, T : λ), where the supremum is

taken over the set of allowed λs. Clearly we would like to pick a policy that would minimize this regret, thatone that achieves infπ supλR

π(x, T : λ).

Besbes and Zheevi [18] study a policy π(κ, τ) that works as follows: There is a learning phase of lengthτ < T , divided into κ interval of length ∆ = τ/κ. We used equally spaced prices over these intervals asfollows: in the ith interval we use price pi = pmin + (i− 1)(pmax− pmin)/κ, 1 ≤ i ≤ κ− 1. Then for each1 ≤ i ≤ κ, apply price pi over the ith interval and observe the total demand Di. (What happens if we runout of inventory?) Compute

d(pi) = Di/∆.

This is just the maximum likelihood estimate of λ(pi). Then compute

pu = argmaxpid(pi) : 1 ≤ i ≤ κ,

pc = argmin|d(pi)− x/T | : 1 ≤ i ≤ κ,

and set p = maxpu, pc. Then apply the price p over [τ, T ] as long as inventory is positive.

Besbes and Zheevi [18] prove asymptotic optimality considering a sequence of problems (xn, T ;λn)

indexed by n such thatxn = nx, λn(p) = nλ(p), n ≥ 1.

One can think of n as an indicator of market size. They assume that λ belongs to class of bounded, Lipschitzcontinuous functions, with feasible revenue rate at least m > 0. Then they show that if we choose

τn ≈ n−1/4, κn ≈ n1/4,

then for policy πn = π(κn, τn), there exists a finite positive C such that

supλRπn(xn, T ;λn) ≤ C(log n)1/2

n1/4.

Thus the sequence πn of policies is asymptotically optimal.

Assignment: Present a proof of this result from Besbes and Zheevi [18].

The authors have derived a similar result for parametric class of λ’s.

One reason they get a non-adaptive policy over [τ, T ] is the fact that they are looking for a minimax regretpolicies. These tend to be very conservative. This issue is addressed by Bayesian policies, as discussed inthe next section.

Next we discuss the results of den Boer and Zwart [21]. Their model is more suited to internet sales,where we have the ability to observe an arrival (site visit by a user) and whether the visitor purchased an


item. Suppose the initial inventory is C and we can serve at most N visitors, after which the site closesdown and all the unsold inventory is discarded. (We can think that one customer arrives per unit time andthe selling season lasts N time units.) Let Yn = 1 if the nth visitor purchased an item, and zero otherwise.Let pn be the price offered to the nth visitor. Let

d(p) = P(Yn = 1|p)

be the probability of a successful sale if price p is offered. We assume a parametric logit model as givenbelow:

d(p|θ) = P(Yn = 1|p, θ) =exp(−θ0 − θ1p)

1 + exp(−θ0 − θ1p),

where θ = [θ0, θ1] is a pair of parameters. The expected revenue from a customer is pd(p|θ). If θ wascompletely known and we had unlimited inventory, we would choose the optimal price

p∗(θ) = argmaxppd(p|θ).

If θ is known, and the inventory is finite, the optimal price will also depend on the current inventory on hand.Let v(c, n) be the optimal total expected revenue if the current inventory is c and n visitors have alreadyused the site. Then the optimality equation becomes:

v(c, n) = maxppd(p|θ) + (1− d(p|θ))v(c, n+ 1) + d(p|θ)v(c− 1, n+ 1),

with boundary condition v(c,N) = 0. The optimal price in state (c, n) is the p that maximizes the righthand side of the DP equation.

Now consider the case where θ is unknown and inventory is finite (actually C < S). (The case ofunknown θ and infinite inventory is discussed in Section 4.4.3.) Suppose we know the data (Yi, pi), 1 ≤i ≤ n for a given n. Let L(θ) be the likelihood of this data. Then we can estimate θ by maximizing L(θ).Let θn be the MLE of θ based on observing the first n customers. den Boer and Zwart [21] derive results onhow quickly θn converges to the real θ. Consider the myopic policy that uses price p∗(θn) for the (n+ 1)starrival in the season. The authors study a minor variation of this policy: this price is p∗(θn) except whenc = 1 or n = N . See their paper for more details. They show the regret R(T ) of following this policyfor the first T selling periods is O(log2(T )). Remember that regret is how much the total expected revenueunder a policy falls short of the optimal total expected revenue if the parameters are known.

Assignment: Present the salient results of den Boer and Zwart [21].

Actually, if there are multiple selling seasons, it would be of interest to decide on the optimal inventoryC at the beginning of each season.

4.3.4 Discrete Time Model: Bayesian Learning

We illustrate Bayesian learning using the customer arrival process NPP(λ(p)) as defined in Equation 4.3in the discrete time model of Section 4.2.2. Suppose the only unknown is the arrival rate Λ of potential


customers. The functional form of Equation 4.3 is assumed to hold. Thus, if Λ is known, the numberof potential sales in a period with price p is a P(Λe−p) random variable. Suppose we assume that priordistribution of Λ is Gamma(θ,m), where θ is the scale parameter and m is the shape parameter. The priordensity is

f(λ|θ,m) = θe−θλ(θλ)m−1

(m− 1)!.

The probability of observing n sales during that period is given by

α(n|m, θ) =(n+m− 1)!

n!(m− 1)!

(e−p

θ + e−p

)n( θ

θ + e−p

)m.

This is a negative binomial pmf with parameters m and θ. Also define

β(n|θ,m) =

∞∑j=n

α(j|θ,m).

Suppose the observed number of sales in that period is n. Then it is known that the posterior distributionof Λ is Gamma(θ + e−p,m+ n). (See Aviv and Pazgal [6].)

We can now formulate an MDP by defining φk(i|θ,m) as the maximum expected revenue if there are kmore periods to go, the current prior for Λ is Gamma(θ,m), and the current the current inventory level is i.Then the backward dynamic programming equations become

φ0(i|θ,m) = −si,

φk(i|θ,m) = maxp−hi+

i−1∑j=0

α(j|θ,m)[pj+φk−1(i−j|θ+e−p,m+j)]+iβ(i|θ,m), k = 1, 2, · · · , T.

This recursion has to be evaluated numerically. The optimal price in state (i|θ,m) when there are k periodsto go is the value of p that maximizes the right hand side. Then we use this optimal price by keeping trackof the current inventory i, and updating the parameters θ and m as newer sales figures are observed.

Aviv and Pazgal [6] report similar Bayesian analysis of this model in continuous time. We briefly presenttheir results below. Let φi,m(t, θ) be the optimal revenue if sale ends after t days, and the current state is(i, θ,m). The HJB equation now becomes

∂

∂tφi,m(t, θ) = sup

p

mθe−p[p+ φi−1,m+1(t, θ)− φi,m(tθ)] + e−p

∂

∂θφi,m(t, θ)

, i ≥ 1,

with the boundary conditionφ0,m(t, θ) = 0.

They derive several properties of the optimal value function and the optimal pricing rule.

Assignment: Present the salient results from Aviv and Pazgal [6].

Lin [70] also considers a similar Bayesian learning model, and derives simpler heuristic pricing policies.

Assignment: Present the salient results from Lin [70].

4.4. UNLIMITED INVENTORY, DYNAMIC PRICING. 75

Other authors have analyzed alternate Bayesian models. For example, Araman and Caldenty [2] assumethat the sales are a NPP with price dependent rate given by θλ(p). The function λ(p) is completely known.The only uncertainty is about the scale parameter θ. They assume that θ takes two possible valus θH > θL.The value θH implies a big market, while θL implies a small market. The state of the knowledge is givenby the current probability q = P(θ = θH). They derive an HJB equation for v(n, q), the optimal revenueif the initial inventory is n and the initial belief is q. The state variable q is updated according to Bayes’framework as time changes and sales occur. Also see Harrison et al [51].

Assignment: Present the salient results from Araman and Caldenty [2].

Assignment: Present the salient results from Harrison et al [51].

Subramhanyan and Shoemaker [102] study the joint pricing and inventory with Bayesian learning. As aspecial case, in Section IV, they study the discrete time model where the inventory can ordered only at thebeginning of a finite selling season.

Assignment: Present the results from Section IV of Subramhanyan and Shoemaker [102].

4.4 Unlimited Inventory, Dynamic Pricing.

So far we had assumed that there is a finite initial inventory and no reordering is possible. Now we shallconsider several models of optimal pricing where inventory plays no role.

4.4.1 Linear Regression

We begin with an infinite horizon, discrete time model considered by Le Guen [67]. He assumes a parametricform for the demand as a function of price, and uses statistical methods to update these parameters as moredata becomes available. There is always enough inventory on hand to satisfy all the demand, so the inventoryplays no role in this analysis.

Suppose the demand in a period with fixed price p curve is given by

d(p) = a− bp+ ε,

where ε is a mean zero random variable with unknown distribution. We assume that p ∈ [pmin, pmax]. If thedemand parameters a and b were known, the optimal price p∗ that maximizes pE(d(p) is given by

p∗ = H(b/2a, pmin, pmax)

where H(x, a, b) (with a < b) is a function such H(x, a, b) = a if x < b, = x if a ≤ x ≤ b, and = b ifx > b.

Now suppose a and b are unknown. At time n we set price pn, which results in salesDn = a−bpn+εn,where εn, n ≥ 1 have mean zero. We use the data pk, 1 ≤ k ≤ n and the corresponding sales


Dk, 1 ≤ k ≤ n in a linear regression model to compute the estimates an and bn of a and b, and thenchoose pn+1 = H(bn/2an, pmin, pmax) as above. The estimates are given by

bn =n∑n

k=1 pkDk −∑n

k=1 pk∑n

k=1Dk

n∑n

k=1 p2k − (

∑nk=1 pk)

2,

an =

∑nk=1Dk

n− bn

∑nk=1 pkn

.

One can develop an efficient recursive procedure to compute (an+1, bn+1) in terms of (an, bn). One canshow that almost surely,

an → a, bn → b,

1

n

n∑k=1

pkDk → p∗,

1

n

n∑k=1

pkDk → optimal revenue.

Assignment: Prove the above convergence using Chapter 2 of Le Guen [67].

Notice that the policy constructed above is a myopic policy: we pick a price at time n to maximize theexpected revenue in period n. However, that choice affects the estimates of the demand function parametersand hence the future decisions. Thus the myopic optimal pricing may not maximize the total expectedrevenue over finite time, since it may be better to use lower than myopic optimal prices to increase thedemand in order to get a better handle on the estimates of a and b. This is the classic “earning” versus“learning” tradeoff inherent in systems with unknown parameters.

4.4.2 Bayesian Learning

The material in this section is based on Carvalho and Puterman [27]. They consider a discrete time finitehorizon problem. They assume that the demand at time n, with price pn, is given by

d(p) = exp(a+ bpn + εn),

where a and b are unknown and εn, n ≥ 1 are iid N(0,σ2) random variables, with known σ2. We assumeb < 0.

If a and b are known the optimal price in any period is 1/b and the optimal expected revenue over Tperiods is given by

R∗(T ) = −Tbea−1+σ

2/2.

When a and b are not known, this gives an upper bound on the expected revenue. Carvalho and Putermandevelop a Bayesian model in this case. So suppose the initial prior distribution of (a, b) is a bivariate normalwith mean θ0 and covariance matrix σ2P0. Then one can show that the posterior distribution after observingthe demand data up to n is also a bivariate normal with mean θn and covariance matrix σ2Pn. This is the

4.4. UNLIMITED INVENTORY, DYNAMIC PRICING. 77

prior for making pricing decisions in period n + 1. Let pn be the price chosen and Dn be the demandobserved in period n and define

yn = log(Dn), zn = [1 pn]′, Fn = znPn−1z′n + 1.

Then we have the following recursion to update θ and P :

θn = θn−1 + Pn−1z′nF−1n [yn − θ′n−1zn], n ≥ 1,

Pn = Pn−1 − Pn−1z′nF−1n znPn−1, n ≥ 1.

The prior estimates of a, b at time n are an−1 = θn−1,1 and bn−1 = θn−1,2.

Let us first consider a two-period problem. Then the prior for (a, b) at time 2 is based on the observedsales at time 1, and is given by BN(θ1, σ2P1). The optimal price at time 2 is clearly p∗2 = −1/b1 = −1/θ1,2,since no more learning is involved. The expected revenue in period 2

R∗2(b1) = E

(− 1

b1ea−b/b1+ε2

)= − 1

b1ea−b/b1+σ

2/2.

This is maximized by choosing b1 = b, which we do not know. We know that b1 is N(θ1,2, σ2b1

), whereσ2b1 = σ2P1,(2,2). Hence, using Taylor expansion, and computing the expectation with respect to b1, we get

E(R∗2) = R∗2(b) +1

2

ea−1+σ2/2

b3σ2b1 .

Thus we should choose the price p1 to maximize

p1 exp(a+ bp1 + σ2/2) +1

2

ea−1+σ2/2

b3σ2b1 .

Note that σ2b1 is a also a function of p1. We use the initial estimates of a and b to compute the above p1.Clearly, the more accurate (that is, lower σ2) the estimates, the close is the total expected revenue to the realoptimal. This illustrates the tradeoff between learning and earning. If we were not interested in learning, thesecond term will be absent from the optimization.

It is intractable to extend this to more than two periods. However, one can use this as a heuristic policy:called the two-step look ahead policy. Under this policy, at time n we choose a pn that maximizes

pn exp(an−1 + bn−1pn + σ2/2) +G(n)

2

ean−1−1+σ2/2

b3n−1σ2bn(pn).

The authors include the multiplicative term G(), a decreasing function of n, with G(T ) = 0. This indicatesthat we should pay more attention to learning in early stages (small n), and reduce it as we learn more. Theyevaluate the performance of this policy numerically and compare it to the optimal policy.

Assignment: Present the results in section 4 of Carvalho and Puterman [27].

Carvalho and Puterman [28] perform a similar analysis of an internet based platform where it is possibleto observe how many users visited the platform and how many of them actually bought the item. They usea Binomial demand model and develop heuristic pricing policies that achieve the tradeoff between learningand earning.

Assignment: Present the salient results from Carvalho and Puterman [28].


4.4.3 Minimax Regret

Here we present a simplified version of the results of Broder and Rusmevichientong [23]. The setting is thesame as in den Boer and Zwart [21] as discussed in Section 4.3.3, except there is unlimited inventory, andinfinite horizon. (C = N = ∞). They start with the policy of offering the price p∗(θn) to the (n + 1)stcustomer.

The authors propose a variant of this simple policy so as to balance the learning and earning phases.They call it MLE-CYCLE policy. It has a given set of exploratory prices p = [p1, p2, · · · , pk]. It roughlyworks in cycles as follows: The cth cycle consists of the learning phase of k customers followed by anearning phase of c customers. In the cth cycle, we offer the prices p1, · · · , pk to the first k customers andobserve their responses Y1(c), · · · , Yk(c). Then compute the MLE of θ based on the responses from all thecustomers in the learning phase of this and all the previous cycles, and use the optimal price p∗(θ) for thenext c customers in the earning phase the cth cycle. The exploratory prices do not change from cycle tocycle.

Broder and Rusmevichientong [23] present the rate at which the regret from this policy decreases as thenumber of customers goes to infinity.

4.4.4 Neural Network Models

Three types of parametric demand functions are commonly used in practice: linear, exponential, and logit.Shakya et al [98] study a neural network based demand function. We describe it briefly below. Suppose at thebeginning of each week we are supposed to set and publish daily prices for that week, say [p1, p2, · · · , p7].We have available data for the past T weeks, giving the price vector and the daily demand vector for eachweek. We use this data to fit a demand model of the type

dt = dt(p1, p2, · · · , p7), 1 ≤ t ≤ 7,

where dt is the demand on the tth day of the week. This assumes that demand on a day depends on all theseven prices in the week that the day belongs to. Thus a linear model is given by

dt = at +

7∑j=1

bjtpj ,

the exponential model is given by

dt = exp(at +7∑j=1

bjtpj),

and the logit model is given by

dt = Bexp(−btpt)

1 +∑7

j=1 exp(−bjpj).

The Neural network model uses seven different networks, one for each day of the week. Each network hasthree layers: the input layer has eight nodes, one for each price, and one bias node; the hidden layer has

4.5. ROBUST OPTIMIZATION 79

four nodes, one of which is a bias node, and the output layer has a single node. The output for the networkfor day t represents dt, the demand for day t, given the seven prices as input. The transfer function is thesigmoid function. We train the network using the T weeks worth of data.

Once we have a prediction formula for dt, we need to solve the following optimization problem

Maximize7∑t=1

dtpt.

For the linear model this reduces to a quadratic program, and for the exponential and logit model it is a non-linear program. For neural network model, the authors use evolutionary algorithm to solve the optimizationproblem, since it is best suited for black box functions. They perform extensive numerical explorationwhich shows that if the model is correctly identified, then the estimation using that model, followed by theoptimization algorithm using that model works the best. But when the model cannot be identified correctly,the neural network model outperforms on the average. They do not report results based on real data.

Assignment: Present the salient results from Shakya et al [98].

Although the authors don’t mention it, this method can also be used to account for covariate data. Forexample, the demand on day t may depend on the price on day t, along with other covariates, such asweather, price of the competing brands, etc. Liu and Wang [72] have develop precisely this type of modelfor internet sale of books. The demand depends on the price, the rating received by the book, and price ofthe book on competing websites. They construct a neural network with an input layer with three nodes (plusa bias node), a hidden layer with five nodes (plus a bias node), and an output layer with a single node. Theoutput represents the sales during one period. They develop an algorithm for tuning the network that canefficiently deal with sequential data.

4.5 Robust Optimization

Many authors have used robust optimization to handle parameter uncertainty, see den Boer [22] for a goodsurvey. We do not cover this area here since the policies produced are very conservative.


Chapter 5

Data Driven Medical Decisions

A typical medical decision making problem involves diagnosing a patient. That is, given the patient’spersonal information and current symptoms, the doctor needs to decide which possible disease the patientmay have. Once one is reasonably sure of the diagnosis, one proceeds to the treatment stage, where one hasto settle on the “best” treatment for the patient.

The models that help in this situation are fundamentally different than the ones we have encountered sofar. To understand this distinction, let S is the set of possible patient states (history, symptoms, etc), and Dbe the set of possible diagnoses (diseases). It is often the case that the function that maps S into D is notknown precisely. It may not even exist. If it did, one could simply define Si to be the set of symptoms thatmap onto diagnosis i, for i ∈ D. Then the problem would reduce to collecting enough information aboutthe state of the patient so that it lies in exactly one Si. However, even if such a function exists, the patientstate is almost never fully known. Getting full information can be impossible, or costly, or harmful to thepatient. So we need to create an approximation of such a function. In all the models we studied so far, sucha function was built using basic building blocks such as demand functions, so the problem of approximatingthe function reduced to the problem of estimating the basic building blocks. But in medical decision making,we have do not have a such a simple structure. So we need to think differently. We need to build what areknown as statistical models. We discuss several common models below.

5.1 Statistical Models

A very readable account of the topics in this section can be found in James et al [55] and Bertsimas et al [13].

5.1.1 Linear Regression

Suppose the elements of S are vectors, say x = [x1, x2, · · · , xp] ∈ Rp, and the set D is the real line R. Thelinear regression assumes that the mapping is linear, with an additive error term. That is, F : S → D, is

81

82 CHAPTER 5. DATA DRIVEN MEDICAL DECISIONS

assumed to bey = F (x) = β0 + β1x1 + β2x2 + · · ·+ βpxp + ε,

where ε is the unknown error term, and the coefficients βi, 0 ≤ i ≤ p, are unknown. We have n observations(yi, xi = [xi1, xi2, · · · , xip]), i = 1, 2, · · · , n of this function. We write these in matrix form as

y = Xβ + ε,

where the first column of X is all ones, and ε = [ε1, · · · , εn]′ are error terms. We estimate the coefficients βby minimizing the sums of squares

n∑i=1

(yi − fi)2

wherefi = β0 + β1xi1 + β2xi2 + · · ·+ βpxip.

This yields the OLS (ordinary least square) estimator

β = (X>X)−1Xy.

The goodness of such a model is described by computing the corresponding R2 value, defined by

R2 = 1−∑

i(yi − fi)2∑i(yi − y)2

.

The closer it is to 1, the better the fit.

5.1.2 Logistic Regression

Let S be as in the previous subsection, but D has only two elements: 0 and 1. (Say, ulcers, or no ulcers.)Since we do not have a deterministic function that maps patient states onto 0 and 1, we try to impute theprobability that the actual function value is 1. The logistic regression model assumes that the probability isgiven by the following logit function

P (Y = 1|x) = f(β, x) =1

1 + exp(β0 + β1x1 + β2x2 + · · ·+ βpxp).

We have n observations (yi, xi = [xi1, xi2, · · · , xip]), i = 1, 2, · · · , n, with yi = 0 or 1. The likelihoodof this, assuming independent observations, is given by

L(β) =

n∏i=1

f(β, xi)yi(1− f(β, xi))1−yi .

One can obtain the MLE β by maximizing the above likelihood.

This does not completely solve the problem. For a given patient symptom vector x, this gives theprobability p(x) = f(β, x) that the patient has the disease. This does not tell us if the patient has the diseaseor not. A simple way to accomplish this is to use a threshold c as follows:

p(x) ≥ c⇒ Y = 1, p(x) < c⇒ Y = 0.

5.1. STATISTICAL MODELS 83

How do we know how good this rule is? One method is to construct the Receiver Operating Curve(ROC), as explained below.

Note that the predicted Y can be zero or 1, and the real Y can be zero or one. This gives rise to the fourcases given in Table 5.1.

Predicted Y → 0 1Real Y ↓

0 True Negative False Positive

1 False Negative True Positive

Table 5.1: Classification Errors

As the threshold c increases from 0 to 1, the False Positive rate (the conditional probability that thepredicted state is 1 given the real state is 0) and the True Positive rate (the conditional probability that thepredicted state is 1 given the real state is 1) both decrease from 1 to 0. We want True Positive rate to beclose to one, and the False Positive rate to be close to zero. The graph of True Positive (on y-axis) vs.False Positive (on x-axis) as c varies from zero to one is called the ROC (Receiver Operating Characteristic)curve. The area under this curve (AUC) represents the probability that a p(x) for an x with true diagnosis1 is greater than p(x) for an x with a true diagnosis 0. One can take AUC as an indicator of how good thelogistic regression is at discriminating between the true states. The true positive plus the true negative truenegative rate is the correct classification rate, and that should be as high as possible.

An example of an ROC curve is shown in Figure 5.1.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive

0

0.2

0.4

0.6

0.8

1

True P

ositive

Figure 5.1: Example of an ROC Curve.


5.1.3 Regression Trees

A regression tree is a way of making regression more understandable from the user’s point of view. Supposewe have the data S = (yi, xi = [xi1, xi2, · · · , xip]), i = 1, 2, · · · , n as before. We first divide the data setinto two parts S1(j, c) = (y, x) ∈ S : Xj ≤ c and S2(j, c) = (y, x) ∈ S : Xj > c. Let y1(j, c) be theaverage of yi over the data in S1(j, c) and y2(j, c) be the average of yi over the data in S2(j, c). Then wefind a j and a c that minimizes

SSQ(j, c) =∑

(yi,xi)∈S1((j,c)

(yi − y1(j, c))2 +∑

(yi,xi)∈S2((j,c)

(yi − y2(j, c))2.

We use this (j, c) to split the data set into two subsets. We repeat this procedure for each of the two subsets.This continues until the resulting subsets are too small, or some other criterion is reached. For each of thefinal subset, called the leaf node of the tree, we use the average over that set as the prediction for that subset.If such a tree is too large, one prunes it by aggregating some branches in a judicious fashion. An exampletree is shown in Figure 5.2.

X1 ≤ 4

X2 ≤ 3 X2 ≤ 6

Yes No

Yes No Yes No

y = 4.5 y = 8 y = 11 y = 16 Leaf nodes

Figure 5.2: Example of a Regression Tree.

5.1.4 Classification Trees

A classification tree is similar to the regression tree, except that the dependent variable y is categorical.In this case, we do the splitting into two subsets S1(j, c) and S2(j, c) as before. Let y1(j, c) be the mostcommonly occurring y-value in the data in S1(j, c) and y2(j, c) e the most commonly occurring y-value inthe data in S2(j, c). Define the correct classification rate crk(j, c) as the fraction of the y values in Sk(j, c)that equal yk(j, c). Then we choose the j and c that maximizes the correct classification rate

CCR = cr1(j, c) + cr2(j, c).

5.1. STATISTICAL MODELS 85

Using this j and c we partition the data into two subsets. We apply this procedure next to the two subsets,and proceed until a stopping criterion is reached. In each of the leaf nodes, we use the most commonlyoccurring value of y as the prediction for that node. Reverse pruning is used to reduce the size of the tree.

5.1.5 Random Forests

Random forests is a clever way of applying bootstrapping to the tree methods. Here we generate M treesusing the same data by randomizing as follows: at each split in the tree, we use a random subset of size mfrom the set of p predictors. Typically m << p, with m ≈ √p being a reasonable choice. Once we haveM trees, we use them to come up with M different predictions as before, and then choose their average (incase y is a real number) or the mode (in case y is categorical) as our final predictor. The predictions fromthe random forest methodology typically have smaller variance.

5.1.6 Principal Component Analysis and Singular Value Decomposition

Principle Component Analysis (PCA) is an unsupervised learning tool. Let X be the n by p matrix of data,that is centered so that the average of each column is zero. It is possible to write X as a product of threematrices

X = UΣW>

where U is an n by n matrix of orthonormal columns, Σ is an n by p diagonal matrix (that is Σi,j = 0 ifi 6= j), and W is a p by p matrix of orthonormal columns. The diagonal elements of Σ are nonnegativereal numbers in decreasing order, and they are called the singular values of X . The product representationis called Singular Value Decomposition (SVD) of X . The ith column of UΣ is called the ith principalcomponent of X . One can use the first k columns of UΣ as the reduced data matrix T with n observations,but only k features. This is called dimensionality reduction. Typically k = 2 or 3 is sufficient to capture theessence of the data.

5.1.7 K-means Clustering

Let X be the n by p data matrix as in the previous subsection. Suppose we want to partition its n data pointsinto K subsets so that their within-the-subset variation is small, and between-the-subsets variation is large.Here K is a pre-specified number. Suppose the partition is S1, S2, · · · , SK . The centroid of the kth subsetis defined as

xk =1

|Sk|∑i∈Sk

xi.

Define the data mean as

x =1

n

n∑i=1

xi.


The within the subset variation for subset Sk is defined as

sk =∑i∈Sk

||xi − xk||.

Here ||x|| is the sums of squares of x. The within-the-subsets variation is given by

s =K∑i=1

||xk − x||.

The aim is to find a partition that minimizes the above. We describe one algorithm to do so below:

Step 0: Given data matrix X and the number of clusters K.Step 1: Create an initial partition S1, · · · , SK by assigning each data point randomly to one of the Kclusters. Compute the cluster centroids xk, 1 ≤ k ≤ K.Step 2: Create an improved cluster by assigning each point to the cluster whose centroid is closest to it (inEuclidean distance sense).Step 3: If the new partition is different than the old one, go back to Step 2. Otherwise, stop.

This is a simple greedy algorithms that can find a local minimum. One typically runs the algorithm withmultiple initial partitions and picks the best local optimum found among them.

5.1.8 Hierarchical Clustering

The K- means algorithm needs a value of K. Hierarchical clustering avoids this and creates a tree structureto help identify the clusters. We define a dissimilarity measure between clusters, the Euclidean distancebetween their centroids being one such measure. The hierarchical algorithm is described below:

Step 0: Initially assign each data point to its own cluster, that is, set Si = i. Compute dissimilaritymeasure d(Si, Sj) for all pairs (i, j).Step 1: Find a pair (i, j) with the smallest d(Si, Sj), and merge the clusters to create a cluster Si ∪Sj . Thisreduces the number of clusters by one.Step 2: Repeat step 1, until there is only one cluster left.

5.1.9 Q-Learning

Suppose (Xn, An), n ≥ 0 is an MDP with finite state space S, finite action space S, transition probabil-ities px,y(a), and random reward function R(x, a) for x, y ∈ S and a ∈ A. Let γ be the discount factor.Let V ∗ be the optimal value function, that is V ∗(x) is the maximum infinite horizon expected discountedreward starting from state x. It is given by the unique solution to the Bellman optimality equation:

V ∗(x) = maxa∈AQ∗(x, a), x ∈ S,

Q∗(x, a) = r(x, a) + γ∑y∈S

px,y(a)V ∗(y), x ∈ S, a ∈ A.

5.2. BREAST CANCER SCREENING 87

LetA∗(x) = argmaxa∈AQ∗(x, a), x ∈ S.

Then the policy that chooses action A∗(x) in state x is optimal.

Q-learning is an iterative method of computing an approximationQn ofQ∗ so thatQn → Q∗ as n→∞.It works as follows.

1. Let Q0 be a given initial approximation. (Could be zero.)

2. At time n observe Xn = x and choose an action An = a.

3. Observe Xn+1 = y, and R(Xn, An) = r.

4. compute Qn(x, a) as follows:

Qn(x, a) = (1− αn)Qn−1(x, a) + αn(r + γmaxb∈AQn−1(y, b)).

(Here αn, n ≥ 1 is a given sequence of umbers in [0,1], called the learning rates.) All other entriesin Qn are same as those in Qn−1.

5. repeat until some termination criterion is satisfied.

Watkins and Dayan [109] proved the following convergence result: Let ni(x, a) be the time when actiona is tried in state x for the ith time. Assume that ni(x, a)→∞ as i→∞. That is, each state-action pair istried infinitely often. Suppose

∞∑i=1

αni(x,a) =∞,

and∞∑i=1

α2ni(x,a) <∞,

for all x ∈ S and a ∈ A. ThenQn(x, a)→ Q∗(x, a)

for all (x, a).

Thus Q-learning method directly learns the optimal policy, without learning the reward and transitionprobabilities. There are several variations of Q-learning methods in the literature.

5.2 Breast Cancer Screening

The US Preventive Services Task Force recommends an average woman should get the first mammogramdone at age of 50 and the repeat it every two years until the age of 74. If the mammogram is positive (that is,it indicates presence of a lump), a biopsy is done to confirm cancer. How does one evaluate the benefits of


such a policy? How does one come up with an “optimal” starting point (50 years) and the screening interval(2 years)? This requires a reasonable mathematical model and accurate data. Here we present a simpleMarkovian model inspired by Lee and Zelen [66].

Let X(t) be the state of breast-cancer in a woman when she is t years old. X(t), t ≥ 0 is a stochasticprocess with state space 0, 1, 2. State 0 implies that the female is disease-free, or the disease is unde-tectable. State 1 is the preclinical stage, that is, the sizeable tumors exist that can be detected by a mammo-gram with high probability, but otherwise asymptomatic. State 2 is clinical stage, that is, tumors have beendetected and the disease is symptomatic and readily identifiable. The state trajectory is always 0→ 1→ 2.

We assume that X(0) = 0.

We assume that X(t), t ≥ 0 is a CTMC with the following generator matrix:

Q =

−λ0 λ0 0

0 −λ1 λ1

0 0 0

.Thus the disease-free state lasts for an exp(λ0) amount of time, the preclinical stage for an exp(λ1)

amount of time, and the clinical stage is absorbing. The initial distribution at time 0 is p(0) = [1 0 0]. In theabsence of any screening, the state-distribution of the patient at time t is given by

p(t) = p(0) exp(Qt).

It is easy to that

exp(Qt) =

e−λ0t λ0λ0−λ1 (e−λ1t − e−λ0t) 1− λ0e−λ1t−λ1e−λ0t

λ0−λ10 e−λ1t 1− e−λ1t

0 0 1

.Let Define pi(t) = P (X(t) = i|X(0) = 0) (i = 0, 1, 2) assuming no screening is done. Then, for t ≥ 0,we have

p0(t) = e−λ0t,

p1(t) =λ0

λ0 − λ1(e−λ1t − e−λ0t),

p2(t) = 1− λ0e−λ1t − λ1e−λ0t

λ0 − λ1.

Thus the incidence rate of breast cancer is given by λ1P1(t). One can use observed incidence rates toestimate the λ0 and λ1. Table 5.2 shows the incidence rate of breast cancer per 100,000 as a function of age.Thus the incidence rate I(t) over t ∈ [50, 55) is 233.7/100,000. One can choose the parameters λ0 and λ1to minimize the squared arror ∫ 100

10(λ1P1(t)− I(t))2dt.


Age Range I(t)

10-14 .115-19 .120-24 125-29 7.530-34 25.235-39 63.840-44 125.445-49 197.850-54 232.755-59 278.060-64 343.365-69 412.170-74 451.075-79 483.980-84 477.485+ 432.5

Table 5.2: Incidence rates for Breast Cancer

Using numerical optimization we see that the integral is minimized at

λ0 = λ1 = .0092.

Figure 5.3 shows the observed incidence rate I(t) (the piecewise constant function) and the estimated in-cidence rate λ1p1(t) (the smooth curve), as a function of age t. As seen from the figure, this is not a verygood fit. This just means that the assumption of exponential sojourn times in state 0 and 1 is not very good.However, we shall use these values of λ0 and λ1 in the subsequent calculations.

Figure 5.4 plots the three probabilities pi(t) (i = 0, 1, 2) as a function of t with λ0 = λ1 = .0092,assuming no screening is done. The top curve is p0, the middle curve is p1 and the bottom curve is p2.

Now consider a screening policy π given by an increasing sequence tn, 1 ≤ n ≤ N, where tn is thetime when nth mammogram is administered. Note that a mammogram is administered if the patient is notalready in state 2. Let β be the sensitivity (true positive rate) of mammography, that is, β is the probabilitythat the mammogram says the patient has breast cancer given that the patient has breast cancer (the patientis in state 1). Let α be the specificity (true negative rate) of mammography, that is, α is the probability thatthe mammogram says the patient has no cancer, given that the patient has no cancer (the patient is in state0). If the mammogram turns out to be negative (that is, no lesion of any kind is indicated), we get no moreinformation, and the posterior probability of the state of the patient is the same as the prior probability. Ifthe mammogram is positive (that is, indicates presence of tumors), a more definitive test (such as biopsy) is


10 20 30 40 50 60 70 80 90 100

Age, t, in years

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1P 1(t), I(t

)

10-3

Figure 5.3: Observed incidence rate I(t) and estimated incidence rate λ1p1(t), as a function of t .

0 10 20 30 40 50 60 70 80 90 100

Age, t, in years

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p 0(t), p 1(t),

p 2(t)

Figure 5.4: pi(t) as a function of t under no screening.


done. That test will reveal the presence of tumors accurately. If it finds no tumors, we declare the true stateto be zero (disease free), otherwise declare the state to be 2 (not 1, because now we know that tumors exist).

Let pπ(t) be the probability that patient is in state i at time t assuming screening policy π is in effect.Then pπi (t) = pi(t) for 0 ≤ t < t1. The probability that the patient is in state i just before the nthmammogram is given by pπi (tn−). Thus the posterior probabilities (after the mammogram) are

pπ0 (tn+) = pπ0 (tn−), pπ1 (tn+) = (1− β)pπ1 (tn−), pπ2 (tn+) = 1− pπ0 (tn+)− pπ1 (tn+).

Thus the mammogram causes an instantaneous jump from state 1 to 2 if it is positive. If the state is 0 beforethe mammogram it continues to be 0 after the mammogram.

The stochastic process X(t), tn ≤ t < tn+1 is a CTMC with sameQmatrix as before, but the startingwith initial distribution pπ(tn+) at tn. Thus we have

pπ(t) = pπ(tn+)eQ(t−tn), tn ≤ t < tn+1.

Thus the probability distribution before the (n+ 1)st mammogram is given by

pπ(tn+1−) = pπ(tn+)eQ(tn+1−tn).

A biopsy is unnecessary (false positive) if the state is 0 and the mammogram is positive. A mammogram isbeneficial (true positive) if the state is 1 and the mammogram is positive.

Using these recursions we can compute the expected total number of biopsies under policy π as

T =

N∑n=1

(1− β)nn∏k=1

pπ1 (tk−).

Of these the expected number of unnecessary biopsies (false positives) is

B = (1− α)

N∑n=1

pπ0 (tn−),

and the expected number of breast cancer cases that are discovered early (true positives) is given by

D = βN∑n=1

pπ1 (tn−).

Note that D is also the probability that the screening identifies the cancer in the female, instead of a sponta-neous transition from state 1 to 2.

Next we consider the policy of doing the first screening at age 50 and then doing it every 5 years until theage of 80 or until the cancer is detected. In Figure 5.5 we plot the state distributions p1 and p2 as a functionsof age. After each screening, pπ1 (t) jumps down, and pπ2 (t) jumps up. Comparing this with Figure 5.4 wesee that, after the screening starts, the probability of being in state 1 decreases drastically, and probabilityof state 2 increases. The expected number of unnecessary biopsies can be computed to be .3866, while theexpected number of useful biopsies is .4291. The expected total number of biopsies done is 1.6342.


0 10 20 30 40 50 60 70 80

Age, t, in years

0

0.1

0.2

0.3

0.4

0.5

0.6

p 1(t), p 2(t)

Figure 5.5: pπi (t) as a function of t under a screening policy.

There are two main questions: First, how do we determine the disease progression model? Second, howdo we evaluate a screening policy? Lee and Zelen [66] provide several approaches to answer these ques-tions. They model the disease progression as three state semi-Markov process, and then use age dependentincidence rate data to estimate the distribution of sojourn times in states 0 and 1. They propose severalmethods to evaluate a policy. Here we propose another one. Suppose km is the cost of mammogram and kbis the cost of biopsy. Then the total cost of the policy π for a single patient is given by

kmT + kb(B +D).

The probability that disease is detected by the screening program is D. So the cost per detected case is

(kmT + kb(B +D))/D.

For example, suppose km = 300 and kb = 2000 dollars. Then the cost per detected case for the (50, 5, 80)

policy π is given 4944 dollars per detected case. The corresponding figure for the recommended (50, 2, 74)policy is 3433 dollars per detected case. Thus, if our disease model is correct, (50, 2, 74) is a better policythan (50, 5, 80).

Lee and Zelen [66] propose a threshold screening policy that works as follows: do the first screening assoon as p1(t) reaches a threshold c. After the mammogram, p1(t) will jump down. Do the screening everytime p1(t) reaches the threshold c. This is a dynamic policy that performs better than the rigid periodicscreening policies. In our numerical example, suppose we choose c = .2016, so that the first mammogramwill be done at the age of 50. In this case p1(t) jumps down at 50 to .0403, but never climbs back to .2016,so we never do another mammogram. So in this case the dynamic policy reduces to the single mammogrampolicy and produces a cost of 4161 dollars per detected case.

5.3. EPIDEMIC MANAGEMENT 93

The problem of optimal screening policy is difficult to formulate since the objective function (cost perdetected case) is not additive.

Assignment: Present the relevant results from Lee and Zelen [66].

Next we describe the model studied by Maillart et al [78]. They develop a discrete time Markov modelof disease progress where by the patient state is 0 (no disease), 1 (small tumors, less than 1.9 cm), 2 (bigtumors 2.0 cm or more), 3 (death breast cancer) or 4 (death from other causes). The patient state is trackedevery six months. Let Xn be the state at the nth period. Let an be the patient age at period n. If we track thepatient from age 25 to 100, the period n takes values from 0 to 150, and the age in period n is an = 25+n/2

years. Let πi(n) = P (Xn = i) be the state distribution in period n. If Xn ∈ 0, 1, 2, we do not knowthe actual value of Xn, while Xn = 3 and 4 are observable states. Hence we think of this as a partiallyobservable DTMC. In that case π(n) = [π0(n), π1(n), π2(n)], n ≥ 0 is a DTMC. They consider a largeclass of policies described by four parameters [starting age, first interval, switch age, second interval, stopage]. Thus a policy [25, 3, 55, 4, 75] implies the first screening at age 25, then screening every 3 years untilthe age of 55, then screening every four years until the age of 75. They compute two performance measuresfor a policy:

1. A = The probability of eventual death from breast cancer,2. B = The expected number of mammograms conducted.

These are unambiguous and not subjective, since they do not use quality of life measures. They evaluateover 1200 policies and plot them on the A − B plane and find the efficient frontier. A recommendedpolicy should belong to the efficient frontier. They mention several data sources from which relevant data isused. The main drawback of their model is that they assume that the breast cancer is revealed only after ascreening, and never symptomatically.

5.3 Epidemic Management

The material in this section is based on Ludkovski and Niemi [77].

5.3.1 SIR Model

We begin with a brief overview of a common model of an infectious disease called the SIR model (seeAnderson and Britton [1]). Here a member of a community starts as a susceptible (S), then gets infected(I), and is finally removed (R) or recovers. Assume that the total number of individuals in the community isfixed, say N . Let S(t) and I(t) be the number of susceptible and infected individuals at time t. The numberof removed members is then R(t) = N − S(t)− I(t). The SIR model assumes that, in the absence of anycontrol, (S(t), I(t)), t ≥ 0 is a CTMC with two parameters β and γ with rates given by

q((s, i)→ (s− 1, i+ 1)) = βsi/N,


q((s, i)→ (s, i− 1) = γi.

Typically it is assumed the epidemic starts at time 0 when a single members brings infection to the commu-nity. Thus I(0) = 1 and S(0) = N − 1. It can be shown that E(I(t)) is a unimodal function of time: itincreases initially, and then reduces to zero. The epidemic dies out at time T given by

T = mint ≥ 0 : I(t) = 0.

The size of the epidemic is total number of members who are affected by the epidemic, namely R(T ). Thiscan be computed approximately by the Kermack-McKendrick equation

log((N − E(R(T )))/S(0)) +R0E(R(T ))/N = 0,

where R0 = β/γ is the reproductive number. If R0 > 1, the epidemic is affects a large fraction of thecommunity, while R0 < 1 implies that the epidemic dies out quickly.

In practice the parameters β and γ are unknown, and need to be estimated based on the observed tra-jectory of (S(t), I(t)), t ≥ 0. One way to do this is to use Bayesian methodology. Assume that at time0, the prior distribution of β is Gamma(aβ, bβ) and that of γ is Gamma(aγ , bγ), and these are independent.Now suppose we have observed the epidemic up to time t and have noted r1(t) and r2(t) as the number ofS → I and I → R transitions over [0, t]. Then posterior distributions of β and γ at time t are independentGamma distributions with parameters

(Aβ(t), Bβ(t)) = (aβ + r1(t), bβ +

∫ t

0I(u)S(u)du/N),

and

(Aγ(t), Bγ(t)) = (aγ + r2(t), bγ +

∫ t

0I(u)du).

Using these one can compute the means and the confidence intervals for β and γ.

5.3.2 Control of Epidemics

Our main aim in modeling the epidemic is to control it. We present the control model studied by Ludkovskiand Niemi [77]. They consider two basic methods of controlling an epidemic: quarantine (isolation) andvaccination. Quarantine changes the infection rate from β to κβ, where 0 < κ < 1 is a fixed constant. Vac-cination directly removes the susceptibles without becoming infected as an intermediate stage. Vaccinationintroduces a transition from (s, i) to (s− 1, i) with rate

q((s, i)→ (s− 1, i)) = δs,

where δ is a fixed known constant.

The aim is to decide when to initiate quarantine, when to initiate vaccination to minimize the totalexpected cost. Once a program is initiated, it is in effect until the epidemic is declared over. The cost consists

5.3. EPIDEMIC MANAGEMENT 95

of the cost of the sick members (the infected), and the cost of implementing quarantine and vaccination.Ludkovski and Niemi [77] consider a rather detailed cost function. Here we use a simplified one to illustratethe method. Let the cost of i infected members be c(i) per unit time, the cost rate of quarantine when thereare i infected is cQ(i), and the cost rate of vaccination when there are s susceptible members is cV (s). Weformulate this as a finite horizon discrete time Markov decision process.

The state of the system at time k (before any decision is made) is given by (X(k), Y (k)) where X(k) =

(S(k), I(k), Aβ(k), Bβ(k), Aγ(k), Bγ(k)) and Y (k) = (Q(k), V (k))). Here the joint distribution of (β, γ)

at time k is Gamma(Aβ(k), Bβ(k))Gamma(Aγ(k), Bγ(k)), and Q(k) = 1 if quarantine is in effect at timek, and zero otherwise, and V (k) = 1 if vaccination is in effect at time k, and zero otherwise. The decisionchosen at time k is D(k) = (q, v) ∈ 0, 12. Clearly, if (Q(k), V (k)) = (1, 1) the only decision allowed isD(k) = (1, 1). If (Q(k), V (k)) = (1, 0), D(k) can be (1,0) or (1,1). If (Q(k), V (k)) = (0, 1), D(k) canbe (0,1) or (1,1), and if (Q(k), V (k)) = (0, 0), D(k) can be (0,0), (0,1), (1,0) or (1,1). It is easy to see that(X(k), Y (k)), D(k), k = 0, 1, · · · ,K is an MDP.

Suppose decision chosen at time k after observing state X(k) is D(k). Over time period (k, k + 1), theCTMC (S(t), I(t)), k ≤ t < k+ 1 has parameters (β, γ, δ) if D(k) = (0, 1), (κβ, γ, δ) if D(k) = (1, 1),(β, γ, 0) if D(k) = (0, 0), and (κβ, γ, 0) if D(k) = (1, 0). We have

Aβ(k + 1) = Aβ(k) + r1(k + 1)− r1(k),

Bβ(k + 1) = Bβ(k) +

∫ k+1

kI(u)S(u)du/N,

Aγ(k + 1) = Aγ(k) + r2(k + 1)− r2(k),

Bγ(k + 1) = Bγ(k) +

∫ k+1

kI(u)du.

The cost incurred over (k, k + 1) if decision (q, v) is chosen and X(k) = x, is given by

c(x, (q, v)) = E

(∫ k+1

k[c(I(u)) + qcQ(I(u)) + vcV (S(u))]du

∣∣∣∣X(k) = x,D(k) = (q, v)

).

Let vk(x, y) be the minimum cost over k to K if the state at time k is (x, y). We have

vT (x, y) = 0.


For k < T we have

vk(x, (1, 1)) = C(x, (1, 1)) + E(vk+1(X(k + 1), (1, 1))|X(k) = x,D(k) = (1, 1))

vk(x, (0, 1)) = minC(x, (0, 1)) + E(vk+1(X(k + 1), (0, 1))|X(k) = x,D(k) = (0, 1)),

C(x, (1, 1)) + E(vk+1(X(k + 1), (1, 1))|X(k) = x,D(k) = (1, 1))


C(x, (1, 1)) + E(vk+1(X(k + 1), (1, 1))|X(k) = x,D(k) = (1, 1))


C(x, (1, 1)) + E(vk+1(X(k + 1), (1, 1))|X(k) = x,D(k) = (1, 1)),

C(x, (1, 0)) + E(vk+1(X(k + 1), (1, 0))|X(k) = x,D(k) = (1, 0)),

C(x, (0, 0)) + E(vk+1(X(k + 1), (0, 0))|X(k) = x,D(k) = (0, 0)).

In theory, one can compute the vk’s in a backward recursion to obtain the optimum cost asv0((N − 1, 1, aβ, bβ, aγ , bγ), (0, 0)). However, this is numerically intractable.

There are several approximate numerical algorithms to solve the dynamic programming recursions toobtain the optimal policy. See the book by Powell [88] for many methods of doing so. Q-learning (seeWatkins and Dayan [109]) combines value iteration and learning for the infinite horizon discounted costMDPs. However, it involves having to compute the value function for every state. This can be prohibitivewhen the state space is large or uncountable as in our case. Ludkovski and Niemi [77] suggest a methodcalled Regression Monte-Carlo (RMC) to approximate the value function using regression on a few selectedepidemic covariates (in addition to the eight state variables) and simulation.

Assignment: Present the RMC algorithm from Ludkovski and Niemi [77].

It is possible to consider more complex policies. For example, one may think of κ and δ as decisionvariables, their values representing the severity level of quarantine and the coverage level of vaccinationprogram. The cost will depend on these levels. The aim is then to decide these levels optimally. The authorsdo not consider these policies.

5.4 Precision Medicine

The goal of precision medicine is to improve outcomes by tailoring treatments to individual patients, orgroups of patients. Kosorok and Laber [59] provide a very nice review paper on this topic. Clearly thisbrings together the need for statistics and stochastic models and optimization. We shall illustrate the method-ology by three special cases below.

5.4. PRECISION MEDICINE 97

5.4.1 Patient-specific Treatment Rules

The material presented here is based on Zhao et al [110]. Suppose we have the following data about npatients: (Xi, Ai, Ri), (1 ≤ i ≤ n), where Xi is a numerical vector representing the pre-treatment stateof the ith patient, Ai indicates which of the two treatments is given to the patient (denoted as 1 or -1 forconvenience), and Ri is a non-negative numerical outcome of the treatment for the ith patient, coded so thatthe higher values indicate better treatment. We assume that (Xi, Ai, Ri) ∈ X × A × [0,∞), and the dataform an iid sequence with a probability measure P with π = P (A = 1) > 0 and P (A = −1) = 1− π > 0.

A patient-specific treatment rule (PTR) D is a mapping from the set of patient states X to the set oftreatments A = 1,−1. Thus if a patient state is in x, she is given treatment D(x). One can also describea PTR by a function f : X → (−∞,∞) such that D(x) = sign(f(x)). The efficacy of D is measured bythe expected clinical outcome if we apply the PTR D to a randomly chosen patient, namely,

φD = E(R|X,D(X)) = E

(1A = D(X)RπA+ (1−A)/2

). (5.1)

(Derive this.) This is called the value function of D. We say that D∗ is optimal if

φD∗ ≥ φD,

for all patient-specific treatment rules D. Clearly, optimal rule is given by

D(x) = sign(E(R|X = x,A = 1)− E(R|X = x,A = −1)).

This requires us to build an estimator of E(R|X = x,A = a) from the available data. Zhao et al [110]suggest an alternate approach based on the last expression in Equation 5.1. They formulate the problem offinding the optimal D = sign(f) as the problem of finding the f that minimizes

R(f) =1

n

n∑i=1

Riπi

1Ai 6= sign(f(Xi)),

whereπi = Aiπ + (1−Ai)/2.

One can interpret R(f) as the weighted loss of the classification scheme f . This is not a convex function off , and hence difficult to minimize. Hence the authors use a Support Vector Machine (SVM) surrogate andminimize the following function instead

L(f) =1

n

n∑i=1

Riπiφ(Aif(Xi)) + λn||f ||,

where φ(t) = max(0, 1− t) and ||f || is some norm of f . The authors call this the OWL (outcome weightedlearning) approach.

If we restrict the decision function to the class of affine functions

f(x) = β>x+ β0,


the minimization problem involves first solving the following quadratic program involving a positive tuningparameter κ:

Maximize∑n

i=1 αi −12

∑ni=1

∑nj=1 αiαjAiAj〈Xi, Xj〉

Sub. to: αi ≤ κRi/πi, 1 ≤ i ≤ n,∑ni=1 αiAi = 0

αi ≥ 0, 1 ≤ i ≤ n.

Here 〈x, y〉 is the inner product of x and y. Using the solution α to the above quadratic program, we get

β =n∑i=1

αiAiXi.

The intercept β0 needs a bit more work, and is not given here. The optimal linear treatment rule is thengiven by

D(x) = sign(β>x+ β0).

Interestingly, this methodology can be extended to fitting a non-linear decision function using the theory ofreproducing kernel Hilbert spaces. In particular, the authors use the following inner product:

〈x, y〉 = exp(−σ||x− y||2).

Now we solve the same quadratic program, but use the inner product defined above. The optimal decisionfunction is then given by

f(x) =n∑i=1

αiAi〈x,Xi〉+ β0,

and the patient-specific treatment rule is given by

D(x) = sign(f(x)).

The authors present several theoretical properties of these decision rules, such as consistency, regret bounds,etc.

Finally the authors present the results of using this methodology to determine the optimal patient-specifictreatment rule for depression.

Assignment: Present the computational algorithms to compute the optimal linear and nonlinear f fromZhao et al [110].

5.4.2 Patient-Specific Dosage Rules

Chen et al [29] consider a similar setting as in Section 5.4.1 to determine patient-specific dosage rule. Asdescribed there, we begin with data (Xi, Ai, Ri), (1 ≤ i ≤ n). However, nowAi is the dosage level used on


patient i, assumed to lie in [0,1]. Patient-specific dosage rule is given by f : X → [0, 1]. Its value functionis given by

ν(f) = E(R|X, f(X)).

We are interested in discovering the optimal rule f∗ such that

ν(f∗) ≥ ν(f)

for all rules f . Letg(x, a) = E(R|X = x,A = a).

One can estimate the function g using machine learning algorithms on the given data. Then, it is clear that

f∗(x) = argmaxag(x, a).

Chen et al [29] propose an algorithm to construct f∗ directly from the data, without estimating the g function.They formulate the problem of finding the f∗ as the problem of minimizing

ν(f) =1

n

n∑i=1

Riπi

1Ai 6= f(Xi),

whereπi = P (Ai|Xi).

There are problems with this formulation since Ai and f(x) are continuous quantities. The authors getaround this this difficulty by introducing a parameter φ > 0 and using the following surrogate for ν:

R(f) =1

n

n∑i=1

Riφp(Ai|Xi)

min

(|Ai − f(Xi)

φ, 1

).

where p(a|x) is the conditional density of A given X = x. This can be obtained empirically if the data isobservational, or is known if the data is from randomized trial of the dosage. They discuss computationalalgorithms to compute the optimal linear and nonlinear f , just as in the previous section.

Finally they describe an application of their methodology to the problem of deciding optimal dosage ofWarfarin for preventing blood clotting.

Assignment: Present the algorithms for computing the optimal f from et al [29].

5.4.3 Dynamic Control for Diabetes

The material in this section is based on the results in Luckett et al [76]. Consider a patient with type 1diabetes. The liver of such a patient is unable to produce insulin in sufficient quantities to digest the sugarin the blood. The treatment involves controlling diet, engaging in regular exercise, and injecting insulin asneeded. Typically the patient monitors the blood sugar level by pricking herself, and decides on the whetherto inject insulin or not. The current technology makes it possible to wear a device that will record the food


intake, the exercise done, and the glucose level in the blood, continuously in time. The patient can also wearan insulin pump that can automatically inject insulin as needed. What we need is a decision rule that willtell the insulin pump whether or not to inject the insulin based on the data about the patient collected by thetracking device. Such a dynamic routine is called Mobile Health Application.

We shall discretize the problem so that the decisions are made at times t = 1, 2, 3, ... (say every hour).Let Xt be the state of the patient at time t (say the current blood glucose level, weight, food intake overthe last hour, exercise done in the last hour, etc.) Let S be the state space. We assume that it includes anabsorbing state, labeled 0, such that if the patient reaches that state, she stays there forever (she is out of thetreatment). Let At be the action taken at time t, say 1 if the injection is administered and 0 otherwise. Thetransition probabilities are assumed to satisfy

P (Xt+1 = y|Xt = x,At = a,History up to t− 1) = p(y|x, a). (5.2)

The reward (or utility) at time t is r(Xt, At, Xt+1). The aim is to find a policy that maximizes the totalexpected discounted reward, with discount factor 0 ≤ γ < 1. It is possible to formulate this as an MDP tofind an optimal policy. However, the computations involved are typically too onerous to be implementable.

The authors propose to look for an optimal policy within a class of randomized policies described by avector parameter β. Under this policy, we choose action a in state x with probability π(a, x), where

π(a, x) = P (A = a|X = x) =

exp(β>x)

1+exp(βT x)if a = 1

11+exp(βT x)

if a = 0.(5.3)

If randomization is not practical, one can use a tuning parameter δ ∈ (0, 1) and choose action 1 if π(1, x) ≥δ, and action 0 otherwise.

Let Vβ(x) be the total expected discounted reward from using this policy starting from state x. First stepanalysis yields

Vβ(x) = E(r(X0, A,X1) + γVβ(X1)|X0 = x). (5.4)

Here the expectation is carried out using the distribution of (A,X1) given X0 = x, as given in Equation 5.2.Suppose the distribution of X0 is given by φ. Then we choose the optimal β as

β∗ = argmaxβE(Vβ(X0)).

The difficulty in using this approach is that the transition probabilities and the Vβ function are unknown.This can be alleviated by using the Q-learning algorithm, or the V -learning algorithm proposed by Luckettet al [76]. The first step in V -learning algorithm is to represent value function Vβ as

Vβ(x) = Φ(x)θ

where Φ(x) = [φ1(x), · · · , φq(x)] is a vector of q basis functions φ1, · · · , φq evaluated at x, and θ =

[θ1, · · · , θq]> is a vector of real numbers. Think of the representation as a regression.

We next describe a method of computing the “right” θ. Note that Equation 5.4 can be written as

E(r(X0, A,X1) + γVβ(X1)− Vβ(X0)|X0) = 0,


which implies thatE((r(X0, A,X1) + γVβ(X1)− Vβ(X0))ψ(X0)) = 0,

for any function ψ. This suggests a method of updating β and θ as data accumulates. At time t suppose wehave observed X0, A0, · · · , Xt, At, Xt+1. We use policy π, where decisions at time t is made using anestimated βt in place of β in Equation 5.3. First compute

Λt =t∑

k=1

π(Ak, Xk)

πk−1(Ak, Xk)

γΦ(Xk)φ(Xk+1)

> − Φ(Xk)Φ(Xk+1)>θ

+

t∑k=1

π(Ak, Xk)

πk−1(Ak, Xk)r(Xk, Ak, Xk+1)Φ(Xk).

Note that Λt is a column vector of length q. Let Ω be a q × q positive definite matrix. Compute

θt = argmin

Λ>t ΩΛt + λ||θ||.

Here λ is a tuning parameter to keep θ parsimonious. This is a quadratic program and can be solvedefficiently for a given β. Note that θt is a function of β to be used at time t. Now compute

Vt(β) = E(Φ(X0)θt)

where the expectation is with respect to the initial distribution of X0. Then choose

βt+1 = argmaxβVt(β).

Then at time t+ 1 we make decisions using βt+1.

The authors study the extension of this to multiple users, and with more than two actions. They establishthe consistency properties of the estimator. Finally they apply it to diabetes patients.

Assignment: Present the simulation and observational results from Luckett et al [76].


Chapter 6

Other Topics

There are a large number of other topics that we have not touched upon. We mention a few below.

6.1 Recommendation Systems

Use online data to provide recommendation of items (such as movies, restaurants, cars, etc.) to users.

6.2 Rating Systems

Use user ratings of goods and services to predict and manipulate user behavior.

6.3 Online Advertising

Use user browsing history to populate the screen with relevant advertising. This is also used in printingcoupons on the back of grocery store receipts.

6.4 Predictive Maintenance

Use sensors in expensive machines to predict imminent failures and initiate maintenance.

6.5 Sports Analytics

Use data to predict performance of sports teams.

103

104 CHAPTER 6. OTHER TOPICS

Bibliography

[1] Andersson, H. and T. Britton, (2000). Stochastic epidemic models and their statistical analysis. LectureNotes in Statistics, Vol 151, Springer.

[2] Araman, V. F. and R. Caldentey, (2009). Dynamic pricing for nonperishable products with demandlearning. Operations Research, 57, 1188, 2009.

[3] Aktekin, S. and R. Soyer, (2012). Bayesian Analysis of Queues with Impatient Customers: Applica-tions to Call Centers. Naval Research Logistics, 59, 441-456.

[4] Andrew, E., B. Lim, G. Shanthikumar, and Z. Shen, (2014). Model Uncertainty, Robust Optimization,and Learning. INFORMS Tutorials in Operations Research, 66-94.

[5] Askin, A., M. Armony and V. Mehrotra, (2007). The Modern Call Center: A Multi-Disciplinary Per-spective on Operations Management Research, Production and Operations management, 16, 665-688.

[6] Aviv, Y. and A. Pazgal, (2005). Pricing of short life-cycle products through active learning. WorkingPaper.

[7] Azoury, K.S. (1985). Bayes solution to dynamic inventory models under unknown demand distribution,Management Sci. 31, 11501160.

[8] Babaioff, B., S. Dughmi, R. Kleinberg and A. Slivkins, (2015). Dynamic pricing with limited supply.ACM Transactions on Economics and Computation (TEAC) - Special Issue on EC’12, Part 1 archive,3.

[9] Ban, G. and C. Rudin, (2016). The Big Data Newsvendor: Practical Insights from Machine Learning,submitted to Operations Research.

[10] Bassamboo A. and A. Zheevi, (2009). On a Data-Driven Method for Staffing Large Call Centers.Operations Research, 57, 714-726.

[11] Belobaba, P. P. (1989). Application of a probabilistic decision model to airline seat inventory control.Operations Research, 37, 183-197.

[12] Bertsimas, D., and X. Doan, (2010). Robust and data-driven approaches to call centers. EuropeanJournal of Operational Research, 207, 10721085.

105

106 BIBLIOGRAPHY

[13] Bertsimas, D., A. O’Hair and W. Pulleybank, (2016). The Analytics Edge. Dynamic Ideas, LLC.

[14] Bertsimas, D. and G. Perakis, (2006). Dynamic pricing: a learning approach. In Mathematic andComputational Models for Congestion Charging, volume 101 of Applied Optimization. Springer.

[15] Bertsimas, D., and A. Thiele, (2005). A Data-Driven Approach To Newsvendor Problems, MIT TechReport.

[16] Bertsimas, D., and A. Thiele, (2014). Robust and Data-Driven Optimization: Modern Decision MakingUnder Uncertainty. INFORMS Tutorials in Operations Research, 95-122.

[17] Besbes, O. and A. Muharremoglu, (2013). On implications of demand censoring in the newsvendorproblem. Management Science 59, 1407-1424.

[18] Besbes, O. and A. Zeevi, (2009). Dynamic pricing without knowing the demand function: risk boundsand nearoptimal algorithms. Operations Research, 57, 1407-1420.

[19] Beutel, A.-L. and S. Minner, (2012). Safety stock planning under causal demand forecasting. Int. J.Production Economics 140, 637645.

[20] Bitran, G. and S. Mondschein, (1997). Periodic Pricing of Seasonal Products in Retailing. ManagementScience, 43, 1, 64-79.

[21] den Boer, A. V. and B. Zwart, (2015). Dynamic Pricing and Learning with Finite Inventories. Opera-tions Research, 63, 965 - 978.

[22] den Boer, A. V. (2015). Dynamic pricing and learning: historical origins, current research, and newdirections. Surveys Oper. Res. Management Sci., 20, 1 - 18.

[23] Broder, J. and P. Rusmevichientong. Dynamic pricing under a general parametric choice model. Oper-ations Research, 60, 965 - 980.

[24] Brumelle, S. L. and J. I. McGill, (1993). Airline seat allocation with multiple nested fare classes.Operations Research, 41, 127-137.

[25] Burnetas, A. N. and C. E. Smith, (2000). Adaptive ordering and pricing for perishable products. Oper-ations Research, 48, 436443.

[26] Caley,A.. (1875). Mathematical questions with their solutions. The Educational Times, 23, 18-19. SeeThe Collected Mathe matical Papers of Arthur Cayley, 10, 587-588 (1896). Cambridge Univ. Press,Cambridge.

[27] Carvalho, A. X. and L. Puterman, (2003). Dynamic pricing and reinforcement learning. Proceedingsof the International Joint Conference on Neural NetworksNeural networks.

[28] Carvalho A. X. and L. Puterman, (2005). Learning and pricing in an internet environment with bino-mial demands. Journal of Revenue and Pricing, 3, 320-336.

BIBLIOGRAPHY 107

[29] Chen, G., D. Zengb, and M. Kosorok, (2016). Personalized Dose Finding Using OutcomeWeightedLearning. J. Amr. Stat. Assoc., Thor Meth., 516, 1509-1547.

[30] Chow, Y.S., and Robbins, H., (1963). On optimal stopping rules, . Z. Wahrscheinlichkeitstheorie, 2,3349.

[31] Chu, L., G. Shanthikumar, and Z. Shen, (2008). Solving operational statistics via a Bayesian analysis.Operations Research Letters, 36, 110 116.

[32] Conrad, S. (1976). Sales data and the estimation of demand. Oper. Res. Quart. 27, 123127.

[33] Ding, X., M.L. Puterman and A. Bisi, (2002). The censored newsvendor and the optimal acquisitionof information. Oper. Res. 50, 517527.

[34] Federgruen, A. and A. Heching, (1999). Combined Pricing and Inventory Control under Uncertainty,Operations Research, 47, 454-475.

[35] Ferguson, T. S. (1989). Who Solved the Secretary Problem? Statistical Science, 4, 282289.

[36] Freeman, P. (1983). The Secretary Problem and Its Extensions: A Review. International StatisticalReview, 51, 189-206.

[37] Gallego, G. and I. Moon, (1993). The distribution free newsboy problem: review and extensions.Journal of the Operational Research Society, 825834.

[38] Gallego, G. and G. van Ryzin, (1994). Optimal Dynamic Pricing of Inventories with Stochastic De-mand over Finite Horizons. Management Science, 40, 999-1020.

[39] Gans, N., G. Koole and A. Mandelbaum, (2003). Telephone Call Centers: Tutorial, Review, and Re-search Prospects. Manufacturing and Service Operations Management, 5, 79-141.

[40] Gans, N., H. Shen, Y.-Z. Zhou, N. Korolev, A. McCord and H. Ristock, (2015). Parametric Forecastingand Stochastic Programming Models for Call-Center Workforce Scheduling. Manufacturing & ServiceOperations Management, 17, 571-588.

[41] Garnett, O., A. Mandelbaum and M. Reiman, (2002). Designing a Call Center with Impatient Cus-tomers. Manufacturing and Service Operations Management, 4, 208-227.

[42] Gilbert, J. and F. Mosteller, (1966). Recognizing the Maximum of a Sequence, JASA, 61, 35-73.

[43] Green, L., P. Kolesar, and W. Whitt (2007). Coping with Time-Varying Demand When Setting StaffingRequirements for a Service System, Production and Operations Management, 16, 1339.

[44] Green, L. V., P. J. Kolesar, J. Soares, (2001). Improving the SIPP approach for staffing service systemsthat have syclic demands. Operations Research, 49, 549564.

[45] Godfrey, G. A. and W. B. Powell, (2001). An adaptive, distribution-free algorithm for the newsvendorproblem with censored demands, with applications to inventory and distribution. Management Science.

108 BIBLIOGRAPHY

[46] Goldenshluger, A. and A. Zeevi. (2017). and Optimal Stopping of a Random Sequence with UnknownDistribution. Working paper.

[47] Gusein-Zade, S. M. (1966). The problem of choice and the optimal stopping rule for a sequence ofindependent trials. Theory Probab. Appl. 11, 472-476.

[48] Hadley, G. and T. Whitin, (1963). Analysis of Inventory Systems. Englewood Cliffs, NJ.

[49] Halfin, S. and W. Whitt, (1981). Heavy traffic limits for queues with many exponential servers. Oper-ations Research, 29, 567-588.

[50] Harpaz, G., W. Y. Lee, W. Y. and R. L. Winkler, (1982). Learning, Experimentation, and the OptimalOutput Decisions of a Competitive Firm. Management Science 28, 589-603.

[51] Harrison, M. J., B. N. Keskin, and A Zeevi, (2012). Bayesian dynamic pricing policies: Learning andearning under a binary prior distribution. Management Science, 58, 570-586.

[52] Harrison, J. M. and Zeevi Z., (2005). A Method for Staffing Large Call Centers Based on StochasticFluid Models. Manufacturing and Service Operations Management, 7, 20-36.

[53] Huh, W. T. and P. Rusmevichientong, (2009). A nonparametric asymptotic analysis of inventory plan-ning with censored demand. Mathematics of Operations Research, 34, 103123.

[54] Huh, W. T., R. Levi, P. Rusmevichientong and J. Orlin, (2011). Adaptive Data-Driven Inventory Con-trol with Censored Demand Based on Kaplan-Meier Estimator. Operations Research, 59, 929-941.

[55] James, G., D. Witten, T. Hastie and R. Tibshirani, (2013). An Introduction to Statistical Learning withApplications in R, Springer.

[56] Jemal, A., E. Ward and M. Thun, (2007) Recent trends in breast cancer incidence rates by age andtumor characteristics among U.S. women. Open Access Journal Breast Cancer Research, http://breast-cancer-research.com/content/9/3/R2.

[57] Kaplan, E. L. and P. Meier, (1958) Nonparametric estimation from incom- plete observations. J AmStat Assoc, 53, 457-481.

[58] Kennedy, D. P. and Kertz, R. P. (1991). The asymptotic behavior of the reward sequence in the optimalstopping of i.i.d. random variables. Ann. Probab., 19, 329-341.

[59] Kosorok, M, and E. Laber, (2018). Precision Medicine, Preprint.

[60] Kriesel , D. (2017) A Brief Introduction to Neural Networks (ZETA2-EN), Pdf Text.http://www.dkriesel.com/_media/science/neuronalenetze-en-zeta2-2col-dkrieselcom.pdf

[61] Kulkarni, V. G. (2015). Modeling and analysis of stochastic systems. CRC Ptess.

[62] Kunnumkal, S. and H. Topaloglu, (2008). Using stochastic approximation methods to compute optimalbase-stock levels in inventory control problems. Operations Research, 56, 646664.

BIBLIOGRAPHY 109

[63] Kushner, H. and G. Ying, (1997). Stochastic Approximation Algorithms and Applications, Springer-Verlag, New York.

[64] Lau, H.S. (1980). The newsboy problem under alternative optimization objectives. Journal of the Op-erational Research Society, 31, 525-535.

[65] Lazear, E. P., (1986). Retail Pricing and Clearance Sales. The American Economic Review, 76, 14 - 32.

[66] Lee, S. and M. Zelen, (1998). Scheduling periodic examinations for the early detection of disease:Applications to breast cancer. J. Amer. Statist. Assoc., 93, 12711281.

[67] Le Guen, T. (2008). Data driven pricing. MS Thesis, MIT.

[68] Levi, R., R. Roundy and D. Shmoys, (2007). Provably Near-Optimal Sampling-Based Policies forStochastic Inventory Control Models. Mathematics of Operations Research, 32, 821-8.

[69] Levi, R., G. Perakis, and J. Uichanco, (2015). The Data-Driven Newsvendor Problem: New Boundsand Insights. Operations Research, 63, 1294-1306.

[70] Lin, K. Y. (2006). Dynamic pricing with real-time demand learning. European Journal of OperationalResearch, 174, 522-538.

[71] Lindley, D.V. (1961). Dynamic programming and decision theory. Appl. Statist., 10, 39-52.

[72] Liu, G. and H. Wang, (2013). An online sequential feed-forward network model for demand curveprediction, J. Inf. Comput. Sci., 10, 3063 - 3069.

[73] Liyanage, L., and G. Shanthikumar, (2005). A practical inventory control policy using operationalstatistics. Operations Research Letters, 33, 341 348.

[74] Lovejoy, W. S., (1990). Myopic Policies for Some Inventory Models with Uncertain Demand Distri-butions. Management Science, 36, 724-738.

[75] Lu, M., G. Shanthikumar, and Z. Shen, (2015). Technical Note Operational Statistics: Properties andthe Risk-Averse Case. Naval Research Logistics, 62, 206-214.

[76] Luckett, D., E. Laber, A. Kahkoska, D. Maahs, E. Mayer-Davis, and M. Kosorok, (2019). EstimatingDynamic Treatment Regimes in Mobile Health Using V-learning. arXiv:1611.03531v2.

[77] Ludkovski, M. and J. Niemi, (2010). Optimal Dynamic Policies for Influenza Management. StatisticalCommunications in Infectious Diseases, 2(1), Article 5.

[78] Maillart, L. M., J. S. Ivy, S. Ransom and K. Diehl (2008). Assessing Dynamic Breast Cancer ScreeningPolicies, Operations Research, 56, 14111427.

[79] Mandelbaum, A., W. A. Massey, M. I. Reiman, R. Rider, (1999). Time-varying multiserver queueswith abandonments and retrials. P. Key, D. Smith, eds. Proceedings of the 16th International TeletrafficCongress, 355364.

110 BIBLIOGRAPHY

[80] Mandelbaum, A. and S. Zeltyn, (2009). Staffing Many-Server Queues with Impatient Customers: Con-straint Satisfaction in Call Centers. Operations Research, 57, 1189-1205.

[81] Massey, W. A., and W. Whitt (1999). Networks of infinite-server queues with nonstationary Poissoninput. Queueing Systems, 13, 183-250.

[82] McGill, J. I. and G. J. van Ryzin, (1999). Revenue management: Research overview and prospects.Transportation Science, 33, 233-256.

[83] Metters, R., C. Queenan, M. Ferguson, L. Harrison, J. Higbie, S. Ward, B. Barfield, T. Farley, H.Kuyumcu and A. Duggasani, (2008). The ”Killer Application” of Revenue Management: Harrah’sCherokee Casino & Hotel. Interfaces, 38, 161-175.

[84] Moser, L. (1956). On a problem of Cayley. Scripta Math. 22, 289-292.

[85] Oroojlooyjadid, A., L. Snyder, and M. Takac. (2017). Applying Deep Learning to the NewsvendorProblem. Working Paper

[86] Perakis, G. and G. Roels, (2008). Regret in the newsvendor model with partial information. OperationsResearch, 56, 188203.

[87] Petruccelli, J. D. (1985). Maximin optimal stopping for normally distributed random variables.Sankhya Ser. A, 47, 36-46.

[88] Powell, W., (2007). Approximate Dynamic Programming: reducing the curses of dimensionality,Wiley-Interscience.

[89] Qin, Z., and S. Kar, (2013). Single-period inventory problem under uncertain environment. AppliedMathematics and Computation, 219, 96309638.

[90] Quine, M., and Law, J. (1996). Exact results for a secretary problem. Journal of Applied Probability,33, 630-639.

[91] Rudin, C., and G. Vahn, (2014). The Big Data Newsvendor: Practical Insights from Machine LearningAnalysis. MIT Sloan School Working Paper 5036-13.

[92] Sachs, A., and S. Minner, (2014). The data-driven newsvendor with censored demand observations.Int. J. Production Economics, 149, 2836.

[93] Sakaguchi, M., (1961). Dynamic programming of some sequential sampling design, Journal of Math-ematical Analysis and Applications, 2, 446-466.

[94] Samuels, S. M. (1981). Minimax stopping rules when the underlying distribution is uniform. J. Amer.Statist. Assoc., 76, 188-197.

[95] Sarle, W. (1994). Neural Networks and Statistical Models. Proc. Nineteenth Annual SUGI Conference,1-13.

BIBLIOGRAPHY 111

[96] Scarf, H. (1959). Bayes solutions of statistical inventory problem. Ann. Math. Statist. 30, 490508.

[97] Scarf, H, K. Arrow, S. Karlin, (1958). A min-max solution of an inventory problem. Studies in theMathematical Theory of Inventory and Production, 10, 201209.

[98] Shakya, S., M. Kern, G. Owusu and C.M. Chin, (2012). Neural network demand models and evolu-tionary optimisers for dynamic pricing, Knowl.-Based Syst. 29, 4453.

[99] Shapiro, A., D. Dentcheva, and A. Ruszczynski, (2009). Lectures on stochastic programming: model-ing and theory, SIAM.

[100] Smith, B. C., J. F. Leimkuhler and R. M. Darrow. 1992. Yield management at American Airlines.Interfaces, 22, 8-31.

[101] Stewart, T. J., (1978). Optimal Selection from a Random Sequence with Learning of the UnderlyingDistribution, Journal of the American Statistical Association, 73, 775-780.

[102] Subrahmanyan, S. and R. Shoemaker, (1996). Developing optimal pricing and inventory policies forretailers who face uncertain demand, J. Retail. 72, 730.

[103] Talluri, K. T. and G. J. van Ryzin, (2005). The theory and practice of revenue management. Springer.

[104] van Ryzin, G. and J. McGill, (2000). Revenue Management Without Forecasting of Optimization:An Adaptive Algorithm forDetermining Airline Seat Protection Levels. Management Science, 46, 760-775.

[105] Yang, M. (1974). Recognizing the maximum of a random sequence based on relative rank with back-ward solicitation. Journal of Applied Probability, 11, 504-512.

[106] Yu, S., N. Lee, V. G. Kulkarni, H. Sehn (2016). Server Allocation at Virtual Computing Labs viaQueueing Models and Statistical Forecasting, Working Paper.

[107] Yue, J., B. Chen, M.-C. Wang, (2006). Expected value of distribution information for the newsvendorproblem. Oper. Res., 54, 11281136.

[108] Wang, D., Z. Qin, and S. Kar, (2015). A novel single-period inventory problem with uncertain randomdemand and its application. Applied Mathematics and Computation, 269, 133 - 145.

[109] Watkins,C. and P. Dayan, (1992). Q-learning. Machinbe Learning, 8,279-292.

[110] Zhao, Y., D. Zeng, A. Rush, and M. Kosorok, (2012). Estimating Individualized Treatment RulesUsing Outcome Weighted Learning, Journal of the American Statistical Association, 107, 1106-1118.

Index

AIM Algorithm, 25

Bayesian EstimationCensored Demands, 21Observable Demands, 15

CAVE Algorithm, 24Concave Ordering, 13

Data Driven Newsvendor, 26

Kaplan-Meier Estimator, 23

Maximum LikelihoodCensored Demands, 21Observable Demands, 14

Newsvendor Formula, 10Expoential Demands, 10Uniform Demands, 10

Newsvendor ModelMaximize Expected Profit, 9Maximum Utility, 13Mean Variance Tradeoff, 11Minimize Expected Cost, 10Target Driven, 11

Nonparametric ModelEmpirical Distribution, 20Maximin Criterion, 18Minimax Criterion, 18Operational Statistics, 20

Operational Statistics, 16

Stochastic Approximation, 23

112

Documents

Data Driven Decision Models - Vidyadhar Kulkarnivkulkarn.web.unc.edu/files/2018/10/alltextDDDM.pdf · standard deviation increase until the maximum mean 1.53 is reached at x= :69