Aristotle University of Thessalonikiusers.auth.gr/leonid/public/books/AppoximationAlgorithmsForRestle… · 3 Approximation Algorithms for Restless Bandit Problems SUDIPTO GUHA University

3

Approximation Algorithms for Restless Bandit Problems

SUDIPTO GUHA

University of Pennsylvania

AND

KAMESH MUNAGALA AND PENG SHI

Duke University

Abstract. The restless bandit problem is one of the most well-studied generalizations of the celebratedstochastic multi-armed bandit (MAB) problem in decision theory. In its ultimate generality, therestless bandit problem is known to be PSPACE-Hard to approximate to any nontrivial factor, andlittle progress has been made on this problem despite its significance in modeling activity allocationunder uncertainty.

In this article, we consider the FEEDBACK MAB problem, where the reward obtained by playingeach of n independent arms varies according to an underlying on/off Markov process whose exactstate is only revealed when the arm is played. The goal is to design a policy for playing the arms inorder to maximize the infinite horizon time average expected reward. This problem is also an instanceof a Partially Observable Markov Decision Process (POMDP), and is widely studied in wirelessscheduling and unmanned aerial vehicle (UAV) routing. Unlike the stochastic MAB problem, theFEEDBACK MAB problem does not admit to greedy index-based optimal policies.

We develop a novel duality-based algorithmic technique that yields a surprisingly simple andintuitive (2 + ε)-approximate greedy policy to this problem. We show that both in terms of approxi-mation factor and computational efficiency, our policy is closely related to the Whittle index, whichis widely used for its simplicity and efficiency of computation. Subsequently we define a multi-stategeneralization, that we term MONOTONE bandits, which remains subclass of the restless bandit prob-lem. We show that our policy remains a 2-approximation in this setting, and further, our techniqueis robust enough to incorporate various side-constraints such as blocking plays, switching costs, andeven models where determining the state of an arm is a separate operation from playing it.

Our technique is also of independent interest for other restless bandit problems, and we providean example in nonpreemptive machine replenishment. Interestingly, in this case, our policy providesa constant factor guarantee, whereas the Whittle index is provably polynomially worse.

This article combines and generalizes results presented in two papers [Guha and Munagala 2007b;Guha et al. 2009] that appeared in the FOCS ’07 and SODA ’09 conferences, respectively.S. Guha was supported by an NSF CAREER award and an Alfred P. Sloan fellowship. K. Munagalaand P. Shi were supported by NSF awards CNS-0540347 and CCF-0745761.Authors’ addresses: S. Guha, University of Pennsylvania, Levine Hall, 3451 Walnut Street,Philadelphia, PA 19104, e-mail: [email protected]; K. Munagala, D205, Levine Science Re-search Center, Research Drive, Duke University, Durham, NC 27708, e-mail: [email protected];P. Shi, Duke University, PO Box 99857, Durham, NC 27708, e-mail: [email protected] to make digital or hard copies of part or all of this work for personal or classroom useis granted without fee provided that copies are not made or distributed for profit or commercialadvantage and that copies show this notice on the first page or initial screen of a display along withthe full citation. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistributeto lists, or to use any component of this work in other works requires prior specific permission and/ora fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701,New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2010 ACM 0004-5411/2010/12-ART3 $10.00DOI 10.1145/1870103.1870106 http://doi.acm.org/10.1145/1870103.1870106

Journal of the ACM, Vol. 58, No. 1, Article 3, Publication date: December 2010.

3:2 S. GUHA ET AL.

By presenting the first O(1) approximations for nontrivial instances of restless bandits as well asof POMDPs, our work initiates the study of approximation algorithms in both these contexts.

Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]:Nonnumerical Algorithms and Problems; G.3 [Probability and Statistics]

General Terms: Algorithms, Theory

Additional Key Words and Phrases: Approximation algorithms, stochastic control, multi-armed ban-dits, restless bandits

ACM Reference Format:

Guha, S., Munagala, K., and Shi, P. 2010. Approximation algorithms for restless bandit prob-lems. J. ACM 58, 1, Article 3, (December 2010), 50 pages. DOI = 10.1145/1870103.1870106http://doi.acm.org/10.1145/1870103.1870106

1. Introduction

The celebrated multi-armed bandit problem (MAB) models the central trade-off indecision theory between exploration and exploitation, or in other words betweenlearning about the state of a system and utilizing the system. In this problem, thereare n competing options, referred to as “arms,” yielding unknown rewards {ri }.Playing an arm yields a reward drawn from an underlying distribution, and theinformation from the reward observed partially resolves its distribution. The goalis to sequentially play the arms in order to maximize reward obtained over sometime horizon. Typically, the multi-armed bandit problem is studied under one oftwo assumptions:

(1) The underlying reward distribution for each arm is fixed but unknown (stochas-tic multi-armed bandits). In this case, either the reward distribution is assumedto be completely unknown [Auer et al. 2002; Audibert and Bubeck 2009], orthe distribution is assumed to be parametrized and a prior over the parameteris specified as input [Arrow et al. 1949; Robbins 1952; Wald 1947].

(2) The underlying rewards can vary with time in an adversarial fashion, and thecomparison is against an optimal strategy that always plays one arm, albeitwith the benefit of hindsight (adversarial multi-armed bandits [Abernethyet al. 2008; Auer et al. 2003; Cesa-Bianchi et al. 1997; Flaxman et al. 2005]).

When we relax both the assumptions simultaneously, we encounter the problemwhere the rewards can vary stochastically with time, and the comparison is againstthe optimal strategy is allowed to change arms at will. This leads to the notoriousrestless bandit problem in decision theory, which in its ultimate generality, isPSPACE hard to even approximate [Papadimitriou and Tsitsiklis 1999]. In the lasttwo decades, in spite of the growth of approximation algorithms and the numerousapplications of restless bandits [Ansell et al. 2003; Glazebrook and Mitchell 2002;Glazebrook et al. 2005, 2006; Liu and Zhao 2008; Ny et al. 2008; Bertsimas andNino-Mora 2000; Weber and Weiss 1990; Whittle 1988], the approximability ofthese has remained unexplored. In this article, we provide a general algorithmictechnique that yields the first O(1) approximations to a large class of these problemsthat are commonly studied in practice.

An important subclass of restless bandit problems are situations where the systemis agnostic of the exploration – or the exploration gives us feedback about the stateof the system but does not interfere with the evolution of the system. One such


leonid

Highlight

Approximation Algorithms for Restless Bandit Problems 3:3

problem is the FEEDBACK MAB, which models opportunistic multi-channel accessat a wireless node [Guha and Munagala 2007b; Ahmad et al. 2008; Zhao et al.2007]: The bandit corresponds to a wireless node with access to multiple noisychannels (arms). The state of the arm is the state (good/bad) of the channel, whichvaries according to a bursty 2-state Markov process. Playing the arm correspondsto transmitting on the channel, yielding reward if the transmission is successful(good channel state), and at the same time revealing to the transmitter the currentstate of the channel. This corresponds to the Gilbert-Elliot model [Kodialam andLakshman 2007] of channel evolution. The goal is to find a transmission policy ofchoosing one channel to transmit on every time step, that maximizes the long-termtransmission rate. FEEDBACK MAB also models Unmanned Aerial Vehicle (UAV)routing [Ny et al. 2008]: the arms are locations of possibly interesting events,and whether a location is interesting or uninteresting follows a 2-state Markovprocesses. Visiting a location by the UAV corresponds to playing the arm, andyields reward if an interesting event is detected. The goal is to find a routing policythat maximizes the long-term average reward from interesting events.

This problem is also a special case of Partially Observable Markov DecisionProcesses or POMDPs [Kaelbling et al. 1998; Smallwood and Sondik 1971; Sondik1978]. The state of each arm evolves according to a Markov chain whose state isonly observed when the arm is played. The player’s partial information, encapsu-lated by the last observed state and the number of steps since last playing, yieldsa belief on the current state. (This belief is simply a probability distribution forthe arm being good or bad.) The player uses this partial information in makingthe decision about which arm to play next, which in turn affects the informationat future times. While such POMDPs are widely used in control theory, they arein general notoriously intractable [Bertsekas 2001; Kaelbling et al. 1998]. In thisarticle, we provide the first O(1) approximation for the FEEDBACK MAB and anumber of its important extensions. This represents the first approximation guar-antee for a POMDP, and the first guarantee for a MAB problem with time-varyingrewards that compares to an optimal solution allowed to switch arms at will.1

Before we present the problem statements formally, we survey literature on thestochastic multi-armed bandit problem. (We discuss adversarial MAB after wepresent our model and results.)

1.1. BACKGROUND: STOCHASTIC MAB AND RESTLESS BANDITS. The stochas-tic MAB was first formulated by Arrow et al. [1949] and Robbins [1952]. It residesunder a Bayesian (or decision theoretic) setting: we successively choose betweenseveral options given some prior information (specified by distributions), and ourbeliefs are updated via Bayes’ rule conditioned on the results of our choices (ob-served rewards).

More formally, we are given a “bandit” with n independent arms. Each arm i canbe in one of several states belonging to the set Si . At any time step, the player canplay one arm. If arm i in state k ∈ Si is played, it transitions in a Markovian fashionto state j ∈ Si with probability qi

k j , and yields reward r ik ≥ 0. The states of arms

1Though some previous work in adversarial MAB literature considers policies with restricted switch-ing between arms [Auer 2002; Zinkevich 2003], our work is among the first to consider arbitrary andunlimited switching.


3:4 S. GUHA ET AL.

that are not played stay the same. The initial state models the prior knowledge aboutthe arm. The states in general capture the posterior conditioned on the observationsfrom sequential plays. The task is, given the initial states of the arms, find a policyfor playing the arms in order to maximize one of the following infinite horizonquantities:

∑∞t=0 Rtβ

t (discounted reward), or limt→∞ 1t

∑∞t=0 Rt (average reward),

where Rt is the expected reward of the policy at time step t and β ∈ (0, 1) is adiscount factor. A policy is a (possibly implicit) specification of fixing up frontwhich arm (or distribution over arms) to play for every possible joint state of thearms.

It is well known that Bellman’s equations [Bertsekas 2001] yield the optimalpolicy by dynamic programming. The main issue in the stochastic setting is inefficiently computing and succinctly specifying the optimal policy: The input to analgorithm specifies the rewards and transition probabilities for each arm, and thushas size linear in n, but the state space is exponential in n. We seek polynomial-timealgorithms (in terms of the input size) that compute (near-) optimal policies withpoly-size specifications. Moreover, we require the policies to be executable eachstep in polynomial time.

Note that since a policy is a fixed (possibly randomized) mapping from theexponential size joint state space to a set of actions, ensuring polynomial timecomputation and execution often requires simplifying the description of the optimalpolicy using the problem structure. The stochastic MAB problem is the most well-known decision problem for which such a structure is known: The optimal policy isa greedy policy termed the GITTINS index policy [Gittins and Jones 1972; Tsitsiklis1994; Bertsekas 2001]. In general, an index policy specifies a single number called“index” for each state k ∈ Si for each arm i , and at every time step, plays thearm whose current state has the highest index. Index policies are desirable sincethey can be compactly represented, so they are the heuristic method of choice forseveral MDP problems. In addition, index policies are also optimal for severalgeneralizations of the stochastic MAB, such as arm-acquiring bandits [Whittle1981] and branching bandits [Weiss 1988]. In fact, a general characterization ofproblems for which index policies are optimal is now known [Bertsimas and Nino-Mora 1996].

Restless Bandits. In the stochastic MAB problem, the underlying reward dis-tributions for each arm are fixed but unknown. However, if the rewards can varywith time, the problem stops admitting optimal index policies or efficient solutions.The problem now needs to be modeled as a restless bandit problem, first proposedby Whittle [1988]. The problem statement of the restless bandits is similar tostochastic MAB, except that when arm i in state k ∈ Si is not played, its stateevolves to j ∈ Si with probability q i

k j . Therefore, the state of each arm variesaccording to an active transition matrix q when the arm is played, and according toa passive transition matrix q if the arm is not played. The restless bandit problemhas typically been studied in the infinite horizon average reward setting, and thisis the setting in which we will study the problem in this article. It is relativelystraightforward to show that no index policy can be optimal for these problems; infact, Papadimitriou and Tsitsiklis [1999] show that for n arms, even when all q andq values are either 0 or 1 (deterministic transitions), computing the optimal policyis a PSPACE-hard problem. Their proof in fact shows that deciding if the opti-mal reward is non-zero is also PSPACE-hard, hence ruling out any approximationalgorithm as well.


leonid

Highlight


On the positive side, Whittle [1988] presents a poly-size LP relaxation of theproblem. In this relaxation, the constraint that exactly one arm is played pertime step is replaced by the constraint that one arm on average is played pertime step. In the LP, this is the only constraint connecting the arms. (Such decisionproblems have been termed weakly coupled systems [Hawkins 2003; Adelman andMersereau 2008].) Based on the Lagrangian of this relaxation, Whittle [1988] de-fines a heuristic index that generalizes the Gittins index. This is termed the WhittleIndex (see Section 3). Though this index is widely used in practice and has ex-cellent empirical performance [Ansell et al. 2003; Glazebrook and Mitchell 2002;Glazebrook et al. 2005, 2006; Liu and Zhao 2008; Ny et al. 2008; Weber and Weiss1990], the known theoretical guarantees [Weber and Weiss 1990; Glazebrook andMitchell 2002] are very weak. In summary, despite being very well-motivated andextensively studied, there are almost no positive results on approximation guaran-tees for the restless bandit problems.

1.2. RESULTS AND ROADMAP. Our main contribution is developing a novelduality-based algorithmic technique that yields surprisingly simple and intu-itive index policies for a large class of restless bandit problems. In each case,the approximation ratio of the policy we obtain is very small. Our algorithmictechnique involves solving (in polynomial time) the Lagrangian of Whittle’s LPrelaxation for a suitable (and subtle) “balanced” choice of the Lagrange multiplier,converting this into a feasible index policy, and using an amortized accountingof the reward for the analysis. Our technique also yields the first analysis of thewell-known Whittle’s index widely used in these contexts. We explain our resultsin detail below, grouping them by the type of problem and explaining the rationalefor considering them.

FEEDBACK MAB. As discussed before, this problem has received a lot of attentionfrom a number of different communities in recent years [Guha and Munagala 2007b;Ahmad et al. 2008; Ny et al. 2008; Zhao et al. 2007], including several practicalsettings. We show (in Section 2) that our algorithmic technique yields a (2 + ε)-approximate index policy. We also provide a e/(e − 1) integrality gap instance forthis relaxation, showing that our analysis is nearly tight.

We next show (in Section 3) that for FEEDBACK MAB, our duality-based tech-nique is closely related to the widely used Whittle index [Whittle 1988; Ny et al.2008; Ahmad et al. 2008]. In fact, we provide the first constant factor approximationanalysis of a minor variant of this index. The variation is a simple thresholding thatbiases us towards continuing to play an arm if the rewards are sufficient. Therefore,although the Whittle index is not optimal, our result sheds light on its observedsuperior performance in this specific context.

We also show instances where the reward of any index policy is at least a factorof 1 + �(1) smaller than the reward of the optimal policy, so that our results aretight up to small constant factors. We believe that and our analysis will provide anuseful template for the analysis of index policies in related contexts.

MONOTONE bandits. The FEEDBACK MAB problem is defined on two stateMarkov processes. In Section 4, we ask the question: What are the key propertiesof the stochastic system in FEEDBACK MAB that are facilitating our duality-basedanalysis, and can we extend them to general multi-state process? We make partialprogress towards this goal by extracting two crucial properties: Separability andMonotonicity. We use these to define an abstract class of restless bandit problems


leonid

Highlight

3:6 S. GUHA ET AL.

that we term MONOTONE bandits. This class is a multi-state generalization ofFEEDBACK MAB, with the property that the time dependent transitions can befactored into a multiplicative model. We provide a 2 approximation for this class inSection 4 by generalizing the technique in Section 2. Our technique now introducesa balance constraint in the dual of the natural LP relaxation, and constructs the indexpolicy from the optimal dual solution. We further show that the separability andmonotonicity properties are crucial. In the absence of the former, the problem isNP-Hard to approximate, and in the absence of the latter, it has an unboundedintegrality gap.

The advantage of this abstraction is realized by the fact that the analysis can beextended to incorporate side-constraints. In Section 5, we show that our proofs gothrough with only minor modifications when we incorporate multiple simultane-ous plays of varying durations, as well as costs for switching between arms thatare subtracted from the reward accrued. These extensions demonstrate the broadapplicability of the ideas, and reaffirms the notion that these ideas form a coherentoverarching technique.

Though the FEEDBACK MAB is a special case of MONOTONE bandits, we chooseto give a self-contained exposition of the former problem for several reasons:The FEEDBACK MAB problem has been extensively studied in it own right inseveral contexts and across different communities [Guha and Munagala 2007b;Ahmad et al. 2008; Ny et al. 2008; Zhao et al. 2007]. This problem is simpleto state and intuitive, even though the natural LP formulation requires infinitelymany constraints. Given this, it is not immediately clear how to solve Whittle’sLP efficiently and a large part of Section 2 is devoted to this issue. Note that thebehavior of the Lagrangian relaxations are nonobvious as illustrated in Figure 14.Finally, our analysis of Whittle’s index in Section 3, which is built on the discussionin Section 2, is specific to FEEDBACK MAB and uses a geometric intuition, asevident in Lemma 3.1 and Claim 3.4. The analogue of these results remain openfor MONOTONE bandits.

FEEDBACK MAB with Observation Costs. In the preceding discussion, we as-sumed that the feedback is automatic: this is not true in many settings, specially inwireless networks, where the state observation can only be performed with explicitintent and consumes resources such as energy. This aspect sets this problem apartfrom the passive acquisition based problems earlier, and brings it a step closerto the most general restless bandit framework, by allowing an active decision to“buy” information or probe an arm even when it is not being played. In fact, wecan even consider the setting where the play action does not reveal the state ofthe arm. Observe that since the optimum policy is also allowed this option, theoptimum policy can have significantly higher reward than just relying on feedback.From a technical perspective we now have three actions for each arm (play, do notplay, and probe) instead of the standard two actions: this makes the Whittle index,as well as the relaxations discussed heretofore not relevant for this problem. Wediscuss this problem in Section 6, and yet again demonstrate the strength of theoverall technique by showing that for an appropriately written LP relaxation, ourduality-based technique yields a (3 + ε)-approximation.

Nonpreemptive Machine Replenishment. Finally, in Section 7, we derive a 2-approximation for a classic restless bandit problem called non-preemptive machinereplenishment [Bertsekas 2001; Goseva-Popstojanova and Trivedi 2000; Munagalaand Shi 2008]. This problem does not fall in the MONOTONE bandit framework.



This is rendered concrete by showing that the Whittle index for this problem hasa �(n) gap against to the optimal policy. Our duality-based technique, providesa constant factor approximation for this problem. Once again, this showcases theflexibility of our technique in handling a wide variety of restless bandit problems.

1.2.1. Comparison to the Whittle Index. At this point, it would be illustrativeto highlight the key difference between Whittle’s index and our index policy.The former chooses one Lagrange multiplier (or index) per state of each arm,with the policy playing the arm with the largest index. This has the advantage ofseparate efficient computations for different arms; and in addition, such a policy(the Gittins index policy [Gittins and Jones 1972]) is known to be optimal for thestochastic MAB. However, it is well known [Asawa and Teneketzis 1996; Banks andSundaram 1994; Brezzi and Lai 2002] that this intuition about playing the arm withthe largest index being optimal becomes increasingly invalid when complicatedside-constraints such as time-varying rewards (FEEDBACK MAB), blocking plays,and switching costs are introduced. In fact, for the machine replenishment problemin Section 7, the Whittle index has a �(n) gap.

In contrast to the Whittle index, our technique chooses a single global Lagrangemultiplier via a careful accounting of the reward, and develops a feasible policy fromit. Unlike the Whittle index, this technique is sufficiently robust to encompass a largenumber of side-constraints and variants (Sections 5 and 6), and can provide O(1)approximations even when Whittle’s index is polynomially suboptimal (Section 7).Finally, since our technique is based on solving the Lagrangian2 (just like theWhittle index), the computation time is comparable to that for such indices.

In summary, our technique succeeds in finding the first provably approximatepolicies for widely studied control problems, without sacrificing efficiency in theprocess. We believe that the generality of this technique will be useful for exploringother useful variations of these problems as well as providing an alternate algorithmfor practitioners.

1.3. RELATED WORK

Contrast with the Adversarial MAB Problem. While our problem formulationsare based on the stochastic MAB problem, one might be interested in a formulationbased on the adversarial MAB [Auer et al. 2003]. Such a formulation might beto assume that rewards can vary adversarially, or the reward distribution is ad-versarially chosen, and that the objective is to compete with a restricted optimalsolution that always plays the same arm but with the benefit of hindsight (or withforeknowledge of the true reward distribution).

These formulations result in fundamentally different problems. In our formula-tion, the difficulty is computational: we want to compute policies for playing thearms, assuming stochastic models of how the system varies with time. Under theadversarial formulation, the difficulty is informational: we would be interested inthe regret of not having the benefit of hindsight (or of not having the benefit of more

2This aspect is explicit in Sections 2 and 3. However, in Sections 4–7, we have presented our algorithmin terms of first solving a linear program. However, it is easy to see that this is equivalent to solvingthe Lagrangian, and hence to the computation required for Whittle’s index. The details are quitestandard and can be reconstructed from those in Sections 2 and 3.


3:8 S. GUHA ET AL.

information about the reward sequence). A series of papers shows near-tight regretbounds in fairly general settings [Abernethy et al. 2008; Audibert and Bubeck2009; Auer 2002; Auer et al. 2002, 2003; Cesa-Bianchi et al. 1997; Flaxman et al.2005; Lai and Robbins 1985; Littlestone and Warmuth 1994]. However, applyingthis framework is not satisfying in our context: It is straightforward to show thata policy for FEEDBACK MAB that is allowed to switch arms can be �(n) timesbetter than a policy that is not allowed to do so (even assuming hindsight). Anotherapproach would be to define each policy as an “expert”, and use the low-regretexperts algorithm [Cesa-Bianchi et al. 1997]; however, the number of policies issuper-exponentially large, which would lead to weak regret bounds, along withexponential-size policy descriptions and exponential per-step execution time.

We note that developing regret bounds in the presence of changing environmentshas received significant interest recently in computational learning [Auer 2002;Auer et al. 2003; de Farias and Megiddo 2006; Kakade and Kearns 2005; Slivkinsand Upfal 2008; Zinkevich 2003]; however, this direction requires assumptionssuch as bounded switching between arms [Auer 2002; Zinkevich 2003] and slowlyvarying environments [Kakade and Kearns 2005; Slivkins and Upfal 2008], both ofwhich assumptions are inapplicable to FEEDBACK MAB. In an independent work,[Slivkins and Upfal 2008] consider the modification of FEEDBACK MAB wherethe underlying state of the arms vary according to a reflected Brownian motionwith bounded variance. As discussed in Slivkins and Upfal [2008], their problemis technically different from ours, and requires different performance metrics.

Other Related Work. The results in Guha and Munagala [2007a, 2007c], Goelet al. [2006], and Guha et al. [2008] consider variants of the stochastic MABwhere the underlying reward distribution does not change. Although several ofthese results use LP rounding, they are significantly different because only a limitedtime is allotted to learning about this environment, and the recurrent behavior aswe analyse here is not relevant.

We show a 2-approximation for nonpreemptive machine replenishment(Section 7). Elsewhere, Munagala and Shi [2008] considered the special caseof preemptive machine replenishment problem, for which the Whittle index isequivalent to a simple greedy scheme. They show that this greedy policy, thoughnot optimal, is a 1.51 approximation. However, the techniques there are based onqueuing analysis, and do not extend to the non-preemptive case where the Whittleindex can be an arbitrarily poor approximation (as shown in Section 7).

2. The FEEDBACK MAB Problem

In this problem, first formulated independently in Guha and Munagala [2007b],Zhao et al. [2007], Ahmad et al. [2008], and Ny et al. [2008], there is a banditwith n independent arms. Arm i has two states: The good state gi yields rewardri , and the bad state bi yields no reward. The evolution of state of the arm followsa bursty 2-state Markov process (see assumption below) which does not dependon whether the arm is played or not at a time slot. Let sit denote the state of armi at time t . Denote the transition probabilities of the Markov chain as follows:Pr[si(t+1) = gi |sit = bi ] = αi and Pr[si(t+1) = bi |sit = gi ] = βi . The αi , βi , rivalues are specified as input.

We will assume the Markov chains are bursty, so that αi + βi ≤ 1 − δ forsome small δ > 0 specified as part of the input. This assumption ensures that if


leonid

Underline


an arm is not played, the probability that the state of the arm remains the same ismonotonically decreasing with time. All results in this section crucially hinge onthis assumption.

The evolution of states for different arms are independent. Any policy choosesat most one arm to play every time slot. Each play is of unit duration, yields rewarddepending on the state of the arm, and reveals to the policy the current state ofthat arm. When an arm is not played, the true underlying state cannot be observed,which makes the problem a POMDP. The goal is to find a policy to play the armsin order to maximize the infinite horizon average reward.

First observe that since we are considering infinite horizon average reward, wecan assume that in any policy, each arm is played at least once without changingthe average reward of the policy. We can next change the reward structure so thatwhen an arm is played, we obtain reward from the last-observed state instead of thecurrently observed state. This does not change the average reward of any policy.Therefore, from the perspective of any policy, the state of any arm can be encodedas (s, t), which denotes that it was last observed t ≥ 1 steps ago to be in states ∈ {gi , bi }.

Note that any policy maps each possible joint state of n arms into an action ofwhich arm to play. Such a mapping has size exponential in n. The standard heuristicis to consider index policies: Policies that define an “index” or number for eachstate (si , t) and play the arm with the highest current index. The following theoremshows that playing the arm with the highest expected next-step reward (myopicindex policy) does not work, and that index policies in general are nonoptimal.Therefore, the best we can hope for with index policies is a O(1) approximation.

THEOREM 2.1 (PROVED IN APPENDIX A). For FEEDBACK MAB, the reward ofthe optimal policy has an �(n) gap against that of the myopic index policy and an1 + �(1) gap against that of the optimal index policy.

Roadmap. In this section, we show that a simple index policy is a (2 + ε)approximation. This is based on a natural LP relaxation suggested by Whittle, whichwe discuss in Section 2.1; this formulation will have infinitely many constraints.We then consider the Lagrangian of this formulation in Section 2.2, and analyze itsstructure via duality, which enables computing its optimal solution in polynomialtime. At this point, we deviate significantly from previous literature, and presentour main contribution in Section 2.3: A novel “balanced” choice of the Lagrangemultiplier, which enables the design of an intuitive index policy, BALANCEDINDEX,along with an equally intuitive analysis. We use duality and potential functionarguments to show that the policy is (2 + ε) approximation. We conclude byshowing that the gap of Whittle’s relaxation is e/(e−1) ≈ 1.58, indicating that ouranalysis is reasonably tight. This analysis technique generalizes easily (exploredin Sections 4–7) and has rich connections to other index policies, most notably theWhittle index (explored in Section 3).

2.1. WHITTLE’S LP. Whittle’s LP is obtained by effectively replacing the hardconstraint of playing one arm per time step, with allowing multiple plays per stepbut requiring one play per step on average. Hence, the LP is a relaxation of theoptimal policy.

Definition 1. Let vit be the probability of the arm i being in state gi when itwas last observed in state bi exactly t steps ago. Let uit be the same probability


leonid

Underline

leonid

Highlight

leonid

Highlight

leonid

Highlight

leonid

Sticky Note

it is actually infinite since the state space is infinite

3:10 S. GUHA ET AL.

when the last observed state was gi . We have:

vit = αi

αi + βi(1− (1− αi−βi )

t ) and uit = αi

αi + βi+ βi

αi + βi(1− αi−βi )

t

The following fact follows from the burstiness assumption: αi +βi < 1 for all i .

FACT 2.2. The functions vit and 1 − uit are monotonically increasing andconcave functions of t .

We now present Whittle’s LP, and interpret it in the lemma that follows.

Maximizen∑

i=1

∑t≥1

ri xigt

∑ni=1

∑s∈{g,b}

∑t≥1 xi

st ≤ 1∑t≥1

∑s∈{g,b}

(xi

st + yist

) ≤ 1

xist + yi

st = yis(t−1) ∀i, s ∈ {g, b}, t ≥ 2

xig1 + yi

g1 = ∑t≥1 xi

bt vit + ∑t≥1 uit x i

gt ∀i

x ib1 + yi

b1 = ∑t≥1 xi

bt (1 − vit ) + ∑t≥1(1 − uit )xi

gt ∀i

yist , xi

st ≥ 0 ∀i, s ∈ {g, b}, t

LEMMA 2.3. The optimal objective to Whittle’s LP, OPT, is at least the valueof the optimal policy.

PROOF. Consider the optimal policy. In the execution of this policy, for eacharm i and state (s, t) for s ∈ {g, b}, let the variable xi

st denote the probability (orfraction of time steps) of the event: arm i is in state (s, t) and gets played. Let yi

stcorrespond to the probability of the event that the state is (s, t) and the arm is notplayed. Since the underlying Markov chains are ergodic, the optimal policy whenexecuted is ergodic, and the above probabilities are well-defined.

Now, at any time step, some arm i in state (s, t) is played, which implies thexi

st values are probabilities of mutually exclusive events. This implies they satisfythe first constraint in the LP. Similarly, for each arm i , at any step, this arm is insome state (s, t) and is either played or not played, so that the xi

st , yist correspond

to mutually exclusive events. This implies that for each i , they satisfy the secondconstraint. For any arm i and state (s, t), the left-hand side of the third constraintis the probability of being in this state, while the right-hand side is the probabilityof entering this state; these are clearly identical in the steady state. For arm i , theleft-hand side of the fourth (respectively fifth) constraint is the probability of beingin state (g, 1) (respectively (b, 1)), and the right-hand side is the probability ofentering this state; again, these are identical.

This shows that the probability values defined for the execution of the opti-mal policy are feasible for the constraints of the LP. The value of the optimalpolicy is precisely

∑ni=1

∑t≥1 ri xi

gt , which is at most OPT – the maximum possibleobjective for the LP.

This LP encodes in one variable xist the probability that arm i is in state (s, t)

and gets played; however, we note that in the optimal policy, this decision to


leonid

Highlight

leonid

Highlight

leonid

Highlight

leonid

Underline

leonid

Highlight

leonid

Highlight

leonid

Comment on Text

Transition Probabilities

leonid

Comment on Text

at each stem one arm is played on average

leonid

Comment on Text

Each arm at each state is either played or not played

leonid

Comment on Text

to enter a state with t>1, the channel must not have been probed


FIG. 1. The linear program (WHITTLE) for the FEEDBACK MAB problem.

play actually depends on the joint state of all arms. This separation of the jointprobabilities into individual probabilities effectively relaxes the condition of havingone play per step, to allowing multiple plays per step but requiring one play perstep on average. While the optimal solution to Whittle’s LP does not correspondto any feasible policy, the relaxation allows us to compute an upper-bound on thevalue of the optimal feasible policy.

We note yist = ∑

t ′>t x ist ′ . It is convenient to eliminate the variables yi

st bysubstitution and the last two constraints collapse into the same constraint. Thus, wehave the natural LP formulation shown in Figure 1. We note that the first constraintcan either be an inequality (≤) or an equality; without loss of generality, we useequality, since we can add a dummy arm that does not yield any reward on playing.

From now on, let OPT denote the value of the optimal solution to (WHITTLE).The LP in its current form has infinitely many constraints; we will now show thatthis LP can be solved in polynomial time to arbitrary precision by finding structurein the Lagrangian.

2.2. DECOUPLING ARMS VIA THE LAGRANGIAN. In (WHITTLE), the only con-straint connecting different arms is the constraint:

n∑i=1

∑s∈{g,b}

∑t≥1

xist = 1

We absorb this constraint into the objective via Lagrange multiplier λ ≥ 0 toobtain the following objective:

Max. λ + G(λ) ≡ λ +n∑

i=1

∑t≥1

(ri x

igt − λ

(xi

gt + xibt

))( LPLAGRANGE(λ))

∑t≥1

∑s∈{g,b} t x i

st ≤ 1 ∀i∑t≥1 xi

bt vit = ∑t≥1 xi

gt (1 − uit ) ∀i

x ist ≥ 0 ∀i, s ∈ {g, b}, t

Through the Lagrangian, we have removed the only constraint that connectedmultiple arms. It is well-known that for any λ ≥ 0, we have λ + G(λ) ≥ OPT, afact that we will use later.

LPLAGRANGE(λ) yields n disjoint maximization problems, one for each arm i .At any time step, arm i can be played (and reward obtained from it), or not played.Whenever the arm is played, we incur a penalty λ in addition to the reward. Thegoal is to maximize the expected reward minus penalty. Note that if the penalty


leonid

Highlight

3:12 S. GUHA ET AL.

FIG. 2. The Policy Pi (t).

is zero, the arm is played every step, and if the penalty is sufficiently large, theoptimal solution would be to never play the arm. It is immediate (see e.g., Bertsekas[2001]) that for any arm i , the optimal solution to LPLAGRANGE(λ) for arm i exactlyencodes the optimal policy for the above maximization problem, by interpretingxi

st as the probability the policy plays the arm in state (s, t).

Definition 2. Let Li (λ) denote the optimal policy for LPLAGRANGE(λ) re-stricted to arm i , and let Hi (λ) denote its expected reward minus penalty.

Note that the total reward minus penalty of LPLAGRANGE(λ) is the sum for eacharm: G(λ) = ∑n

i=1 Hi (λ).

2.2.1. Characterizing the Optimal Single-Arm Policy. An optimal policy Li (λ)would have the following form: If the arm is observed to be in (g, 1), play the armin state (g, t1); if it is observed to be in (b, 1), play the arm in state (g, t2). Thefact that an optimal policy will have deterministic actions for each state followsfrom the theory of dynamic programming (see, e.g., Bertsekas [2001]). We firstshow that the optimal policy Li (λ) for any arm i belongs to the class of policiesPi (t) for t ≥ 1, whose specification is presented in Figure 2. Intuitively, Step (1)corresponds to exploitation, and Step (2) to exploration. Set Pi (∞) to be the policythat never plays the arm. We present closed-form expressions for the reward andpenalty of this policy in Lemma A.1; these expressions will be used later, in theproof of Lemma 2.5 and Lemma 2.6.

To show that an optimal policy Li (λ) is of the form Pi (t), we first take the dualof LPLAGRANGE(λ), and use it in the lemma that follows.

Minimize λ +n∑

i=1

hi WHITTLE-DUAL(λ)

λ + thi ≥ ri − (1 − uit )pi ∀i, t ≥ 1

λ + thi ≥ vit pi ∀i, t ≥ 1

hi ≥ 0 ∀i

LEMMA 2.4. For any λ ≥ 0, there exists an optimal solution Li (λ) = {xigt , xi

bt}to LPLAGRANGE(λ) that corresponds to a policy of the formPi (t). For this solution,there exists a corresponding optimal solution {hi , pi } to WHITTLE-DUAL(λ). Forthese solutions, for any arm with hi > 0:

(1) hi = Hi (λ).(2) pi ≥ 0.(3) There exists t ≥ 1 such that xi

bt > 0. Denote the smallest such t by ti . We haveλ + ti hi = viti pi

(4) xig1 > 0 and λ + hi = ri − βpi .


leonid

Highlight

leonid

Highlight


(5) The optimal single-arm policy for arm i is Li (λ) = Pi (ti ), where ti is as definedin part (3).

If hi = 0 for arm i, then the optimal single arm policy Li (λ) never plays arm i.

PROOF. The first part follows the definition of strong duality.3 The problemLPLAGRANGE(λ), ignoring the constant λ in the objective, separates into n separateLPs, one for each arm. The dual objective for arm i is precisely hi , which must bethe same as the primal objective, Hi (λ).

If hi = Hi (λ) > 0, the solution to the LP for arm i is the policy Li (λ). As notedabove, in order to have non-zero Hi (λ), the policy Li (λ) must play the arm first insome state (b, ti ) and state (g, t ′

i ). Since xist is the probability this policy plays in

state (s, t), this implies xibt = 0 for t < ti , xi

bti > 0 and xigt ′

i> 0.

Since xibti > 0, by complementary slackness, we have λ + ti hi = viti pi . Since

the left-hand side is at least zero, this implies pi ≥ 0. This proves parts (2) and(3).

To see part (4), observe that for the set of constraints λ + thi ≥ ri − (1 − uit )pi ,since 1 − uit is a monotonically increasing function of t , the right-hand side ismonotonically decreasing in t . Since the left-hand side is monotonically increasing,if the left-hand side and right-hand side are equal, they have to be so for t = 1.Now, since xi

gt ′i> 0, by complementary slackness, λ + t ′

i hi = ri − (1 − uit ′i)pi . By

the above argument, t ′i = 1, which completes the proof of part (4).

Since xig1 > 0 and xi

bti > 0, and since we assumed Li (λ) makes deterministicdecisions at each state, this policy plays the arm in state (g, 1) and in state (b, ti ),which is precisely the description of Pi (ti ). This proves part (5). Finally, if hi = 0,then Hi (λ) = 0 so that an optimal policy Li (λ) is to not play the arm.

Definition 3. For arm i with Hi (λ) > 0, let ti (λ) = argmint≥1{Li (λ) = Pi (t)}.In order to ease our analysis and notation, for the remainder of this section, and

in Section 3, we will assume the optimal solution to LPLAGRANGE(λ) is Pi (ti (λ)),and the optimal solution to WHITTLE-DUAL(λ) is the corresponding dual solution.Note that the constraint λ + ti hi ≥ viti pi in WHITTLE-DUAL(λ) is tight at t = ti (λ)by Lemma 2.4, part (3).

It will be instructive to interpret the problem LPLAGRANGE(λ) restricted to armi as follows. Amortize the reward so that for each play, the arm i yields a steadyreward of λ. The goal is to find the single-arm policy that optimizes the excessreward per step over and above the amortized reward λ per play. As we have shownabove, the optimal value for this problem is precisely Hi (λ), and the policy Li (λ)is of the form Pi (ti (λ)).

2.2.2. Solving LPLAGRANGE(λ). After having decomposed LPLAGRANGE(λ)into independent maximization problems for each arm, and having characterizedthe optimal single-arm policies, we can now solve the program in polynomial time.It will turn out this can be solved by simple function maximization via closed form

3WHITTLE-DUAL(λ) has infinitely many constraints, however, for any λ ≥ 0, it is trivial to find feasiblepoints in WHITTLE-DUAL(λ) and LPLAGRANGE(λ) where all inequalities are strict. Therefore, theSlater conditions are satisfied, implying strong duality and complementary slackness (see, e.g., Faigleet al. [2002]).


leonid

Highlight

leonid

Highlight

3:14 S. GUHA ET AL.

expressions. Let Fi (λ, t) denote the expected reward minus penalty of the policyPi (t) when the penalty per play is λ. We have:

LEMMA 2.5 (PROVED IN APPENDIX A.2). For each arm i, the reward minuspenalty of the optimal single arm policy Li (λ) is

Hi (λ) = maxt≥1

Fi (λ, t) = maxt≥1

((ri − λ)vit − λβi

vi t + tβi

)

Since Li (λ) = Pi (ti (λ)), it follows that ti (λ) is the smallest value of t at whichthe maximum in the above expression is attained. We have that ti (λ) satisfies thefollowing:

(1) If λ ≥ ri (αi

αi +βi (αi +βi )), then ti (λ) = ∞, and Hi (λ) = 0.

(2) If λ = ri (αi

αi +βi (αi +βi )) − ρ for some ρ > 0, then ti (λ) (and hence Hi (λ)) can

be computed in time polynomial in the input size and in log(1/ρ) by binarysearch.

2.3. THE BALANCEDINDEX POLICY. Though we could now useLPLAGRANGE(λ) to solve Whittle’s LP by an appropriate choice of λ (referAppendix A.3 for details), our 2-approximation policy will not be based on thisapproach. For our analysis to work, we must make a subtle but crucial modification.We will set λ to be the sum of the excess reward for all single-arm policiesG(λ) = ∑n

i=1 Hi (λ). Note that, by Lemma 2.6 below, this implies λ ≥ OPT/2 andG(λ) ≥ OPT/2. Intuitively, we are forcing the Lagrangian to balance short-termreward (represented by λ) with long-term average reward (represented by G(λ)).Our balance technique can be generalized to many other restless bandit problems(see Sections 4–7).

We first show how to compute this value of λ in polynomial time by presenting theconnection between G(λ) and OPT, the value of the optimal solution to (WHITTLE).

LEMMA 2.6

(1) For any λ ≥ 0, we have λ + G(λ) ≥ OPT.(2) Hi (λ) is a nondecreasing function of λ.(3) In polynomial time, we can find a λ so that λ ≥ (1 − ε)OPT/2, and G(λ) ≥

OPT/2.

PROOF. Recall from Lemma 2.4, part (1), that hi = Hi (λ). Therefore, λ +G(λ) = λ + ∑n

i=1 hi . The latter is the objective of the dual of (WHITTLE), whichimplies that for any λ, we have: λ + G(λ) ≥ OPT.

We will next show that hi = Hi (λ) is a nonincreasing function of λ. For any λ,consider the value Fi (λ, t) = (ri −λ)vit −λβi

vi t +tβiof the policy Pi (t). (This expression is

proved in Lemma 2.5 and Lemma A.1.) For fixed t , this decreases as λ increases.Therefore, Hi (λ) = maxt Fi (λ, t) is also a nonincreasing function of λ.

Finally, since G(λ) is monotonically nonincreasing in λ, we perform binarysearch to find the required value of λ. Start with λ = ∑n

i=1 ri , and scale λ downby a factor of (1 + ε) until λ < G(λ). Note that for any λ, the value of G(λ) canbe computed in polynomial time by Lemma 2.5. At this point, let λ′ = λ(1 + ε).Since G(λ′) ≤ λ′ and λ′ + G(λ′) ≥ OPT, we have λ′ ≥ OPT/2, which implies


leonid

Highlight

leonid

Underline

leonid

Highlight


FIG. 3. The BALANCEDINDEX Policy for FEEDBACK MAB.

λ ≥ (1 − ε)OPT/2. Further, since λ < G(λ) and λ + G(λ) ≥ OPT, we haveG(λ) ≥ OPT/2.

2.3.1. Index Policy. We start with the value of λ from Lemma 2.6. Let S ={i : Hi (λ) > 0}; the policy only uses arms in S. For this λ, the solution toLPLAGRANGE(λ) yields one policy Pi (ti (λ)) of value Hi (λ) for each arm i ∈ S(see Lemma 2.4 and Def. 3). To ease notation, let ti = ti (λ). Recall that if an armwas last observed in state s ∈ {g, b} some t ≥ 1 steps ago, then its state is denoted(s, t). We call an arm i in state (g, 1) as good; in state (b, t) for t ≥ ti as ready, andin state (b, t) for t < ti as bad. The policy is shown in Figure 3.

Note that the way the scheme works, at most one arm can be in state (g, 1) at anytime step, and if such an arm exists, this arm is played at the current step (and inthe future until it switches out of this state). This can be thought of as executing thepolicies Pi (ti ) for arms i ∈ S independently and in case of simultaneous attemptsto play, resolving conflicts according to the above priority scheme.

Though this policy is not written as an index policy, it is equivalent to thefollowing index: There is a dummy arm with index 0 that does not yield rewardon playing. If hi = Hi (λ) = 0, the index for all states of this arm is −1. For armswith hi > 0, the index for state (g, 1) is 2; that for states (b, t) with t ≥ ti is 1,and that for states (b, t) with t < ti is −1. Though this index is not very natural,it has the advantage of only using three priority classes, and being indifferentto different states with the same priority. This implies that we can use it as aframework to analyze more complex and realistic index policies, by showing thatthese policies essentially break ties within each priority class. In particular, wewill present the first analysis of the well-known Whittle index using this idea inSection 3.

2.3.2. Analysis. We now prove that the BALANCEDINDEX policy is in fact a(2 + ε)-approximation. The proof is based on the fact that the Lagrange multiplierλ and the excess rewards hi = Hi (λ) give us a way of accounting for the averagereward. And by Lemma 2.6, λ ≥ (1 − ε)OPT/2 and

∑hi ≥ OPT/2, which gives

us a way of linking the rewards from our policy to the LP optimum.

THEOREM 2.7. The BALANCEDINDEX policy is a (2 + ε) approximation toFEEDBACK MAB. Furthermore, this policy can be computed in polynomial time.

PROOF. Since the expected per step value Hi (λ) is the expected per step rewardof policy Pi (ti ) minus the expected per step penalty λ per play, we can rearrangethe statement to claim that the expected reward of Pi (ti ) can be accounted asHi (λ) = hi per step plus λ per play. We use this amortization of rewards to showthat the average reward of our index policy is at least (1 − ε)OPT/2.


3:16 S. GUHA ET AL.

Focus on any arm i . We call a step blocked for the arm if the arm is ready forplay–the state is (b, t) where t ≥ ti – but some other arm is played at the currentstep. Consider only the time steps which are not blocked for arm i . For these timesteps, the arm behaves as follows: It is continuously played in state (g, 1). Then ittransitions to state (b, 1) and moves in ti − 1 time steps to state (b, ti − 1). Afterthis the arm might be blocked, and the next state that is not blocked is (b, t) forsome t ≥ ti , at which point the arm is played. This implies that if we consideronly nonblocked steps, the behavior of the policy is identical to Pi (ti ), except thatthe transition probability from (b, ti ) to (g, 1) is vit ≥ viti (since t ≥ ti ). Let Ri (t)denote the expected per step reward of this policy. Using the formula for Ri (t) fromLemma A.1, we have

Ri (t) ≥ rivit

vit + tiβi≥ ri

viti

vi ti + tiβi= Ri (ti ),

which implies that the per-step reward of this single arm policy for arm i restrictedto the nonblocked time steps is at least the per-step reward Ri (ti ) of the optimalsingle-arm policy Pi (ti ). Therefore, for these nonblocked steps, the reward we getis at least hi = Hi (λ) per step, and at least λ per play.

Now, on steps where no arm is played, none of the arms is blocked by definition,so our amortization yields a per-step reward of at least

∑i∈S hi ≥ OPT/2. On steps

when some arm is played, the arm that is played by definition cannot be blocked,so we get a reward of at least λ ≥ (1 − ε)OPT/2 for this step. This completes theproof.

2.3.3. Alternate Analysis. The previous analysis is very intuitive. We nowpresent an alternative way to analyze the policy, that leads to a more generalizabletechnique. This uses a Lyapunov (potential) function argument. The analysis inthis section assumes the λ from Lemma 2.6, part (3). Recall from Lemma 2.4 thathi = Hi (λ); further define ti = ti (λ). Define the potential φi for each arm i at anytime as follows:

Definition 4. If arm i moved to state b some y steps ago (y ≥ 1), the potentialφi is hi (min(y, ti ) − 1). In the state gi the potential is pi . Recall that pi is theoptimal dual variable in WHITTLE-DUAL(λ).

Let T denote the total potential,∑n

i=1 φi , at any step T and let RT denote thetotal reward accrued until that step. Let �RT = RT +1−RT and �T = T +1−T .

LEMMA 2.8. At any time step T , we have:

E[�RT + �T |RT , T ] ≥ (1 − ε)OPT/2,

PROOF. At a given step, suppose the policy does nothing, then all arms are“not ready.” Then, the potential φi of arm i ∈ S increases by hi this step. The totalincrease in potential is precisely

�T =∑i∈S

hi = G(λ) ≥ OPT/2

On the other hand, suppose that the policy plays arm i , which has last beenobserved in state b and has been in that state for y ≥ ti steps. With probabilityq ≥ viti the observed state is gi and the change in reward �RT = ri and the changein potential is pi − hi (ti − 1). With probability 1 − q the observed state is b and



the change in potential is −hi (ti − 1) (and there is no change in reward). Thus inthis case since q ≥ viti , and pi ≥ 0, we have:

E[�RT + �T |RT , T ] ≥ qpi − hi (ti − 1) ≥ viti pi − hi (ti − 1) ≥ λ + hi

≥ (1 − ε)OPT/2.

The penultimate inequality follows from Lemma 2.4, part (3). Note that the poten-tials of arms not played cannot decrease, so that the first inequality is valid.

Finally supposing the policy plays an arm i that was last observed in state giand played in the last step, with probability 1 − βi the increase in reward is ri andthe potential is unchanged. With probability βi the potential will change by −pi .Therefore, in this case, by Lemma 2.4, part (4).

E[�RT + �T |RT , T ] ≥ ri − βi pi ≥ λ + hi ≥ (1 − ε)OPT/2

By their definition, the potentials T are bounded independent of the timehorizon, by telescoping summation, the above lemma implies that limT →∞ E[RT ]

T ≥(1 − ε)OPT/2. This proves Theorem 2.7.

Gap of Whittle’s LP. The following theorem shows that our analysis is almosttight (considering that our 2-approximation is against Whittle’s LP).

THEOREM 2.9 (PROVED IN APPENDIX A). The gap of Whittle’s LP is arbitrar-ily close to e/(e − 1) ≈ 1.58.

3. Analyzing the Whittle Index for FEEDBACK MAB

Before generalizing our 2-approximation algorithm to a larger subclass of restlessbandit problems, we explore the connection between our analysis and the well-known Whittle Index used in practice. This section can be skipped without losingcontinuity of this article.

A well-studied index policy for restless bandit problems is the Whittle In-dex [Whittle 1988]. In the context of FEEDBACK MAB, this index has been in-dependently studied by Ny et al. [2008] and subsequently by Liu and Zhao [2008].Both these works give closed-form expressions for this index and show near-optimal empirical performance. Our main result in this section is to justify theempirical performance by showing that a simple but very natural modification ofthis index in order to favor myopic exploitation yields a (2 + ε)-approximation.The modification simply involves giving additional priority to arms in state (g, 1)if their myopic expected next step reward ri (1 − βi ) is at least a threshold value.

The key idea in the analysis is to map the actions of the modified WHITTLEindex to actions of the BALANCEDINDEX policy. The exact mapping is presented inSection 3.4, after we present the index itself and analyze it.

3.1. DESCRIPTION OF THE WHITTLE INDEX. Defined in general, the Whittleindex is defined for each arm, and for each state of the arm. For state x of arm i , itis the largest penalty-per-play λ such that the optimal policy is indifferent betweenplaying in x and not playing. In our specific problem, the index is formally definedas follows:

Definition 5 (WHITTLE INDEX for Arm i). Suppose the current state of arm iis (s, t), meaning that the arm was last seen to be s ∈ {g, b} (good or bad) t stepsago. The Whittle index �i (s, t) is a nonnegative real number computed as follows.


3:18 S. GUHA ET AL.

FIG. 4. The Whittle Index Policy.

Using the notation from Section 2.2., for any penalty per play λ, there is a single-arm policy Li (λ) that maximizes the average reward minus penalty (excess reward)Hi (λ) over the infinite horizon. Such a policy maps state (s, t) to an action/decision,either “play” or “not play”. When λ = ∞, the optimal action is not to play, andwhen λ = 0, the optimal action is to play. As λ is decreased from ∞, at some valueλ∗, the decision in state (s, t) changes from “not play” to “play”. The WHITTLEindex �i (s, t) is precisely this value of λ∗.

The Whittle index policy always plays the arm whose current state has the highestWhittle index (Figure 4).

Remarks. The Whittle index is strongly decomposable, that is, can be computedseparately for each arm. Further, we have defined λ as a penalty (or amortizedreward) per play, while Whittle defines it as a reward for not playing (which heterms the subsidy for passivity); it is easy to see that both these formulations areequivalent. Finally, for FEEDBACK MAB, it can be shown [Ny et al. 2008; Liu andZhao 2008] that for any state (s, t), there is a unique λ ∈ (−∞, ∞) where thedecision switches between “play” and “not play,” that is, the decision is monotonein λ. Strictly speaking, the Whittle index is defined only for such systems (termedindexable by Whittle [1988]); we will define this aspect away by insisting that theindex λ∗ is the largest value where a switch happens.

We present an explicit connection of the Whittle index to LPLAGRANGE(λ).

LEMMA 3.1 (PROVED IN APPENDIX A). Recall the notation Li (λ) and Pi (t)from Section 2.2. The following hold for �i (s, t):

(1) �i (s, t) ≥ 0 for all states (s, t) where s ∈ {g, b} and t ∈ Z+.(2) �i (g, 1) = ri (1 − βi ), and �i (b, t) ≤ �i (g, 1) for all t ≥ 1.(3) �i (b, t) = max{λ|Li (λ) = Pi (t)}, and is a monotonically non-decreasing

function of t .

Though the Whittle index is widely used, it is not clear how to analyze it sinceit leads to complicated priorities between arms. We now show that our balancingtechnique also implies an analysis for a slight but non-trivial modification to theWhittle index. We point out the exact connection between the modified Whittleindex policy and the BALANCEDINDEX policy in Section 3.4.

3.2. THE THRESHOLD-WHITTLE POLICY. We now show that modifying theWhittle index slightly to exploit the myopic next step reward in good states gyields a 2 approximation. Note that the myopic next step reward of an arm i instate g is precisely ri (1 − βi ), which is the same as �i (g, 1) using Lemma 3.1. Tosee this, if the arm is played in state (g, 1), the state stays (g, 1) with probability(1 −βi ), and in this case, the reward obtained is ri . Our modification to the Whittleindex essentially favors exploiting such a “good” state if the myopic reward is atleast a certain threshold value. In particular, we analyze the policy THRESHOLD-WHITTLE(λ) shown in Figure 5, where we set λ = λ∗, where λ∗ is the value whereλ∗ = G(λ∗) (refer to Section 2.3).



FIG. 5. THRESHOLD-WHITTLE(λ). It exploits arm i if the myopic reward in state (g, 1) is ≥ λ.

Note that the above policy can be restated as playing the arm with the highestmodified index, which is computed as follows: For arm i , if �i (g, 1) = ri (1 − βi )≥ λ, the modified index for state (g, 1) is ∞, else the modified index is the sameas the Whittle index.

THEOREM 3.2. THRESHOLD-WHITTLE(λ∗) is a (2+ε) approximation for FEED-BACK MAB. Here, λ∗ is the value computed in the statement of Lemma 2.6.

3.3. PROOF OF THEOREM 3.2. To ease notation, we will assume that λ∗ in thestatement of Lemma 2.6 satisfies λ∗ = ∑

i hi ≥ OPT/2, and we will show thatTHRESHOLDWHITTLE(λ∗) is a 2-approximation to OPT. We will prove this resultby modifying our analysis of the BALANCEDINDEX policy (from Figure 3). Recallthat S is the set of arms with hi > 0 in the optimal solution to WHITTLE-DUAL(λ∗).For such arms, recall from Lemma 2.4 and Def. 3 that ti (λ∗) = argmint≥1{Li (λ∗) =Pi (t)}. Let ti = ti (λ∗). By Lemma 2.4, part (3), we have λ∗ + ti hi = pi viti for armsi ∈ S. Recall that for arm i ∈ S, state (g, 1) is termed good; state (b, t) is termedready if t ≥ ti ; and state (b, t) is bad if t < ti . The index policy from Figure 3favors good over ready states, and does not play any arm in bad states.

CLAIM 3.3. For any arm i, exactly one of the following is true in the optimalsolutions to WHITTLE-DUAL(λ∗) and LPLAGRANGE(λ∗).

(1) If hi > 0, then �i (b, ti − 1) < λ∗, �i (b, ti ) ≥ λ∗, and �i (g, 1) = ri (1 −βi ) ≥λ∗.

(2) If hi = 0, then �i (b, t) < λ∗ for all t ≥ 1.

PROOF. At penalty λ = 0, the optimal solution to the Lagrangian has value∑i Hi (λ) = ∑

i riαi

αi +βi. This implies λ∗ > 0. Recall from Lemma 2.4 that the

optimal solution to LPLAGRANGE(λ∗) finds the policy Pi (ti ) for every arm i withhi > 0, and that arms with hi = 0 are not played at all.

For the arms with hi > 0, by Lemma 2.4, the optimal single-arm policy isLi (λ∗) = Pi (ti ). Therefore, using the relation from Lemma 3.1, part (3), weimmediately have �i (b, ti ) = max{λ : Li (λ) = Pi (ti )} ≥ λ∗. Suppose the optimalpolicy for penalty λ is Pi (ti (λ)). Then, by Lemma A.4, ti (λ) is monotonicallynondecreasing in λ. Since �i (b, t) = max{λ : Li (λ) = Pi (t)} from Lemma 3.1,part (3), we must have �i (b, t) < λ∗ for all t < ti . Finally, �i (g, 1) = ri (1−βi ) ≥�i (b, ti ) ≥ λ∗ follows from Lemma 3.1.

Next, suppose hi = 0. By Lemma 2.4, the policy Li (λ∗) never plays arm i , sothat by Lemma 3.1, part (3), �i (b, t) < λ∗ for all t ≥ 1.

3.3.1. Types of Arms. As in the BALANCEDINDEX policy (Figure 3), let S ={i : hi > 0}. These are the only arms that BALANCEDINDEX plays. We analyzeTHRESHOLDWHITTLE(λ∗) separately for these arms and for the remaining arms.


3:20 S. GUHA ET AL.

Arms in S. We consider the behavior of THRESHOLD-WHITTLE(λ∗) restricted tojust these arms. We have the following for the policy of Figure 3:

(1) By Claim 3.3, part (1), if the arm is ready (state = (b, t) for t ≥ ti ), we have�i (b, ti ) ≥ λ∗.

(2) If the arm is bad (state = (b, t) for t < ti ), then using Claim 3.3, part (1), wehave �i (b, t) < λ∗.

(3) If the arm is good (state = (g, 1)), then using Lemma 3.1, part (2), we have�i (g, 1) ≥ �i (b, ti ) ≥ λ∗, so that the modified Whittle index is infinity.

Therefore, THRESHOLD-WHITTLE(λ∗) executed only over arms in S also givespriority to good over ready over bad arms. The key difference with the policy inFigure 3 is that instead of idling when all arms are bad, the policy THRESHOLD-WHITTLE(λ∗) will play some bad arm. We now show that this is better than idling.4

CLAIM 3.4. THRESHOLD-WHITTLE(λ∗) executed just over arms in S yieldsreward at least OPT/2.

PROOF. Consider the alternate analysis presented in Section 2.3.3. The INDEXpolicy from Fig. 3 does not play an arm i in bad state, and achieves change inpotential � of exactly hi . All we need to show is that if the arm is played instead,the expected change in potential is still at least hi . The rest of the proof is the sameas that of Lemma 2.8. Suppose the arm is played at time T after idling t ≥ 1 steps.The expected change in potential is: E[�T ] = vit pi − hi (t − 1). We further haveby definition of ti that λ+ ti hi = viti pi . We therefore have pi viti ≥ ti hi . Since vit isa concave function of t with vi0 = 0, the above implies that for every t ≤ ti , we musthave pi vit ≥ thi . Therefore, E[�T ] = vit pi −hi (t−1) ≥ thi −hi (t−1) = hi .

Arms not in S. The only catch now is that THRESHOLD-WHITTLE(λ∗) can some-times play an arm not in S whose hi = 0. For such arms, we count their rewardand ignore the change in potential. We define such an arm j /∈ S as nice if� j (g, 1) = r j (1 − β j ) ≥ λ∗, so that the modified Whittle index in state (g, 1) is∞.

LEMMA 3.5. In THRESHOLD-WHITTLE(λ∗), if arm j /∈ S preempts the play ofan arm i ∈ S, then either the reward from playing j is at least λ∗ or the increasein potential of arm i, �φi , is at least hi .

PROOF. Suppose j /∈ S is nice. When j was last observed to be in state (g, 1),the arm is played continuously even if arms in S become ready. However, for everytime step such an event happens, the current expected reward of playing this armis precisely r j (1 − β j ), which is at least λ∗. When arm j was last observed to bein (b, 0), since h j = 0, by Claim 3.3, part (2), we have � j (b, t) < λ∗ for all finitet . This implies this arm can only pre-empt arms i ∈ S is all such arms are “bad”.But in this case, arm i is in state (b, t) for t < ti , so that the increase in potential ofarm i is precisely �φi = hi using the proof argument in Section 2.3.3.

Now suppose j /∈ S is not nice, then by Lemma 3.1, part (2), we have � j (s, t) <λ∗ for all s = b, g and t ≥ 1. This again implies that such an arm in any state

4 Note that the claim crucially uses the concave nature of the vit function, and need not be true formore general problems, for instance, the MONOTONE bandit problem (Section 4).



can only preempt arms i ∈ S that are bad; in that case, the potential φi of arm iincreases by hi . This completes the proof.

3.3.2. Completing the Proof of Theorem 3.2. There are two cases: First, if anice arm j /∈ S is played, the expected reward plus change in potential (�) ismin{λ∗,

∑i∈S hi } ≥ OPT/2 by the proof of Lemma 3.5. If a ready or good arm

i ∈ S is played, then the proof of Claim 3.4 implies reward plus change in potentialof this arm is again at least OPT/2. In the other remaining case, all arms i ∈ Sare bad, and focusing on just these arms, each yields increase in potential for eacharm is at least hi , so that the total reward plus change in potential of these systemis at least

∑i hi ≥ OPT/2. This completes the proof, and shows that THRESHOLD-

WHITTLE(λ∗) is a 2 approximation. We note that the above analysis extends easilyto the variant where M ≥ 1 arms are simultaneously played per step.

3.4. RELATING THRESHOLDWHITTLE TO BALANCEDINDEX. In order to com-pare the THRESHOLDWHITTLE and BALANCEDINDEX policies, it is useful to splitthe arms and states for each arm into the following groups (all of which are definedrelative to the optimal Lagrangian policies Li (λ∗) = Pi (ti )).

(1) hi > 0; State = (g, 1).(2) hi > 0; State = (b, t) and t ≥ ti .(3) hi = 0; State = (g, 1) and �i (g, 1) ≥ λ∗.(4) hi > 0; State = (b, t) and t < ti .(5) Remaining arms and states.

The BALANCEDINDEX policy gives priority to any arm in group (1) and then toany arm in group (2), breaking ties arbitrarily in the latter group. It ignores arms ingroups (3), (4), and (5). The THRESHOLDWHITTLE policy gives priority to arms ingroups (1) and (3) since their modified Whittle index is ∞; then to arms in group(2) based on their Whittle index, since these arms have Whittle index at least λ∗;and finally to arms in groups (4) and (5) depending on their Whittle index (whichare at most λ∗). This follows from the proofs in Sections 2.3 and 3.3. Note thatunlike the BALANCEDINDEX policy that can idle when all arms are in either groups(3), (4), and (5), the THRESHOLDWHITTLE policy never idles.

4. The General Technique: MONOTONE Bandits

In this section, we present a general and non-trivial sub-class of restless bandits forwhich a generalization of the above balancing technique yields a 2-approximateindex policy. We term this class MONOTONE bandits, and this captures both thestochastic MAB, as well as the FEEDBACK MAB as special cases.

In MONOTONE bandits, there are n bandit arms. Each arm i can be in one ofKi states denoted Si = {σ i

1, σi2, . . . , σ

iKi

}. When the arm is not played, its stateremains the same and it does not fetch reward. Suppose the arm is in state σ i

k and isplayed next after t ≥ 1 steps. Then, it gains reward r i

k ≥ 0, and transitions to one ofthe states σ i

j = σ ik with probability gi (k, j, t), and with the remaining probability

stays in state σ ik . (For notational convenience, we denote σ i

k simply as k; the arm itrefers to will be clear from the context.) The new state is observed by the policy.The state transition trajectories for different arms are independent. At most one


3:22 S. GUHA ET AL.

arm is played per step. The goal is to find a policy for playing the arms so that theinfinite horizon time-average reward is maximized.

In addition, we have the following key properties about the transitionprobabilities:

Separability Property. We assume that gi (k, j, t) is of the form f ik (t)qi (k, j),

where f ik (t) is piece-wise linear with polynomially many break-points (see

Def. 6). The function f ik (t) ∈ [0, 1] for positive integers t can be thought of as

an “escape probability” from the state σk ∈ Si . Conditioned on the escape, thestate changes to σ j ∈ Si with probability qi (k, j), thus

∑j =k qi (k, j) ≤ 1.

Monotone Property. For every arm i and state k ∈ Si , we have: f ik (t) ≤ f i

k (t + 1)for every t .

Remarks. Our solution technique achieves polynomial running time for anyfunction f for which Whittle’s LP (Section 4.2) can be solved in polynomialtime. This is equivalent to insisting that the single arm Lagrangian that extendsLPLAGRANGE(λ) from Section 2 can be solved efficiently, which is true if theWhittle index for each state is efficiently computable. In order to bring out thekey ideas as succinctly as possible, and for clarity of exposition, we will simplyassume that the functions f are piece-wise linear with poly-size specification (seeDefinition 6). We note that the ideas in Section 2 can be used for more complicatedfunctions, we omit further discussion of this issue.

We will assume, without loss of generality, that for each arm i , the graph,where the vertices are k ∈ Si and a directed edge ( j, k) exists if qi ( j, k) > 0,is strongly connected (but not necessarily ergodic). To see that this assumption iswithout loss of generality, we can evolve the Markov Chain for arm i from thestart state till it reaches a strongly connected component, and ignore the remainingstates—this does not change the expected infinite horizon average reward of anypolicy.

We finally note that in showing our upper bound results (Theorem 4.1), themonotonicity and separability properties are equally important. Our terminologyMONOTONE bandits should ideally be SEPARABLE MONOTONE bandits, but we usethe former for brevity.

Results. Our main results can be summarized in the following theorem:

THEOREM 4.1. For the MONOTONE bandits problem, there is a 2 approximationin time polynomial in n and maxi Ki , and the bit complexity of specifying thef, q values. Further, the following lower bounds hold (see Section 4.5): Whenthe monotone property is relaxed, the problem becomes nε-hard to approximate.Finally, if the separability property is not satisfied, then Whittle’s LP, on which theupper bound analysis is based, has an �(n) gap against the value of the optimalpolicy.

Motivation and Special Cases. Intuitively, MONOTONE bandit models optimiza-tion scenarios in which uncertainty increases: when an arm is just played and weobserve its state, we are most certain that our observation still holds true the nexttime step. However, the nondecreasing nature of f implies that as time goes on,the “escape probability” increases and the previous observation becomes less andless reliable. This serves as a model for certain POMDPs, such as the FEEDBACKMAB.



FIG. 6. The linear program (WHITTLE).

Observe that the MONOTONE bandits generalizes the FEEDBACK MAB. For thestates Si = {g, b}, set qi (g, b) = qi (b, g) = 1 and f i

g (t) = 1 − uit and f ib (t) = vit .

Recall from Fact 2.2 that uit , vit are respectively the probabilities of observing thestate g when the state last observed t steps ago was g and b, and that 1−uit , vit areboth monotonically increasing. We also note that MONOTONE bandits generalizesthe stochastic MAB by setting f i

k (t) = 1 for all t .

4.1. HIGH LEVEL IDEA. Unlike the FEEDBACK MAB problem, in MONOTONEbandits, there is no longer a clear distinction between “good” and “bad” states.Note however that an equivalent way of finding λ such that λ = ∑n

i=1 Hi (λ) isto treat λ as a variable and enforce λ = ∑n

i=1 hi as a constraint in the dual ofWhittle’s LP. By taking this approach, the variables pi

k (now one for each statek ∈ Si ) can be interpreted as dual potentials, and the dual constraints are interms of the expected potential change of playing in state k ∈ Si . Based on thesign of this potential change, we can classify the states into “good” and “bad”via complementary slackness. Our index policy continuously exploits arms in“good” states, and waits until the dual constraint goes tight (i.e., the arm becomes“ready”) before playing in “bad” states. We formalize the previous potential-based argument using a Lyapunov function and show a 2-approximation. Wenote that the LP-duality approach is entirely equivalent to the Lagrangian ap-proach; however, it leads to a different interpretation of variables which is moregeneralizable.

4.2. WHITTLE’S LP AND ITS DUAL. As with FEEDBACK MAB, for each arm iand k ∈ Si , we have variables {xi

kt , t ≥ 1}. These variables capture the probabilities(in the execution of the optimal policy) of the event: Arm i is in state k, was lastplayed t steps ago, and is played at the current step. These quantities are well-defined for ergodic policies. Whittle’s LP is presented in Figure 6. Let its optimalvalue be denoted OPT. The LP effectively encodes the constraints on the evolutionof the state of each arm separately, connecting them only by the constraint that atmost one arm is played in expectation every step. The first constraint simply statesthat at any step, at most one arm is played; the second constraint encodes that eacharm can be in at most one possible state at any time step; and the final constraintencodes that the rate of entering state k ∈ Si is the same as the rate of exiting thisstate. This LP will clearly be a relaxation of the optimal policy; the details are thesame as the proof of Lemma 2.3.

This LP has infinite size, and we will fix that aspect in this section. In particular,we now show that the LP has polynomial size when the f i

k are piece-wise linearwith poly-size specification.


3:24 S. GUHA ET AL.

FIG. 7. The polynomial size dual of Whittle’s LP, which we denote (WHITTLE-DUAL). The quantity�Pi

k is defined in Eq. (1).

Definition 6. Given i , k ∈ Si , f ik (t) is specified as the piece-wise linear func-

tion that passes through breakpoints (t1 = 1, f ik (1)), (t2, f i

k (t2)), . . . , (tm, f ik (tm)).

Denote the set {t1, t2, . . . , tm} as W ik . For two consecutive points t1, t2 ∈ W i

kwith t1 < t2, the function f i

k is specified at t1 and t2. For t ∈ (t1, t2), wehave f i

k (t) = ((t2 − t) f ik (t1) + (t − t1) f i

k (t2))/(t2 − t1). For t ≥ tm , we havef ik (t) = f i

k (tm).

Consider the dual of the above relaxation. The first constraint has multiplier λ,the second set of constraints have multipliers hi , and the final equality constraintshave multipliers pi

k . For notational convenience, define:

�Pik =

∑j∈Si , j =k

(qi (k, j)

(pi

j − pik

))(1)

Note that �Pik is a variable that depends on the dual variables pi

∗. We obtain thefollowing dual.

Minimize λ +n∑

i=1

hi

λ + thi ≥ r ik + f i

k (t)�Pik ∀i, k ∈ Si , t ≥ 1

λ, hi ≥ 0 ∀i

Since f ik (t) is piece-wise linear, for two consecutive break-points t1 < t2 in W i

k ,the constraint λ + thi ≥ r i

k + f ik (t)�Pi

k is true for all t ∈ [t1, t2] iff it is true at t1and at t2. This means that the constraints for t /∈ W i

k are redundant. Therefore, theabove dual is equivalent to the the one presented in Figure 7.

Taking the dual of the above program, we finally obtain a polynomialsize relaxation for MONOTONE bandits. Since this poly-size LP only differsfrom (WHITTLE) in restricting t to lie in the relevant set W i

k , and since itwill not be explicitly needed in the remaining discussion, we omit writing itexplicitly.

4.3. THE BALANCED LINEAR PROGRAM. We do not solve Whittle’s relaxation.Instead, we solve the modification of (WHITTLE-DUAL) from Figure 7, whichwe denote (BALANCE). This is shown in Figure 8. The additional constraint in(BALANCE) (as in FEEDBACK MAB) is the constraint λ = ∑n

i=1 hi .The primal linear program corresponding to (BALANCE) is the following

(where we place an unconstrained multiplier ω to the final constraint of



FIG. 8. The dual linear program (BALANCE) for MONOTONE MAB.

(BALANCE)):

Maximizen∑

i=1

∑k∈Si

∑t∈W i

k

r ik x i

kt (PRIMAL-BALANCE)

∑ni=1

∑k∈Si

∑t∈W i

kx i

kt ≤ 1 − ω∑k∈Si

∑t∈W i

kt x i

kt ≤ 1 + ω ∀i∑j =k

∑t∈W i

kx i

kt qi (k, j) f i

k (t) = ∑j =k

∑t∈W i

jx i

j t qi ( j, k) f i

j (t) ∀i, k

xikt ≥ 0 ∀i, k, t

The linear program (BALANCE) is the program that we actually solve. We easilyhave the following sequence of inequalities:

OPT = WHITTLE = WHITTLE-DUAL ≤ BALANCE = PRIMAL-BALANCE

Here, the second inequality follows since BALANCE has one additional constraintcompared to WHITTLE-DUAL.

We now show the following properties of the optimal solution to (BALANCE)using complementary slackness conditions between (BALANCE) and (PRIMAL-BALANCE). In the rest of this section, we only deal with the optimal solutionsto the above programs, so all variables correspond to the optimal setting. Re-call that OPT is the optimal value to (WHITTLE). Since any feasible solution to(BALANCE) is feasible to (WHITTLE-DUAL), we have the following lemma:

LEMMA 4.2. In the optimal solution to (BALANCE), λ = ∑ni=1 hi ≥ OPT/2.

The next lemma is the crux of the analysis, where for any arm being played inany state, we use complementary slackness to explicitly relate the dual variables tothe reward obtained. Note that unlike the analyses of primal-dual algorithms, ourproof needs to use both the exact primal as well as dual complementary slacknessconditions. This aspect requires us to actually solve the dual optimally.

LEMMA 4.3. Assume OPT > 0. Then, one of the following is true for the opti-mal solution to (BALANCE): Either there is a trivial 2-approximation by repeatedlyplaying the same arm; or for every arm i with hi > 0 and for every state k ∈ Si ,there exists t ∈ W i

k such that the following LP constraint is tight with equality.

λ + thi ≥ r ik + f i

k (t)�Pik (2)


3:26 S. GUHA ET AL.

Moreover, in the latter case, for any arm i such that hi > 0, and state k ∈ Si , if�Pi

k < 0, then:

λ + hi = r ik + f i

k (1)�Pik

PROOF. In any solution to (PRIMAL-BALANCE), if ω ≤ −1 or ω ≥ 1, then thevalue of (PRIMAL-BALANCE) is 0, but the optimal value of (PRIMAL-BALANCE) is atleast OPT > 0. Thus, in the optimal solution to (PRIMAL-BALANCE), ω ∈ (−1, 1).Also, since OPT > 0, we have

∑i hi ≥ OPT/2 > 0, so that hi > 0 for at least one

arm i .The optimal solutions to (BALANCE) and (PRIMAL-BALANCE) satisfy the follow-

ing complementary slackness conditions (recall from above that ω > −1 so that1 + ω > 0):

hi > 0 ⇒∑k∈Si

∑t∈W i

k

t x ikt = 1 + ω > 0 (3)

λ + thi > r ik + f i

k (t)�Pik ⇒ xi

kt = 0 (4)

Let us fix an arm i that violates the second condition of the lemma, namely, fix armi and state k ∈ Si , so that hi > 0 and λ+ thi > r i

k + f ik (t)�Pi

k for every t ∈ W ik . By

condition (4), xikt = 0 ∀t ∈ W i

k , which trivially implies that xikt f i

k (t) = 0 ∀t ∈ W ik .

Therefore, in the following constraint in (PRIMAL-BALANCE):∑l =k

∑t∈W i

k

x ikt f i

k (t)qi (k, l) =∑l =k

∑t∈W i

l

x ilt f i

l (t)qi (l, k)

the left-hand side is zero because xikt f i

k (t) = 0, which means the right-hand sideis zero. Since all variables are nonnegative, this implies that for any j ∈ Si withqi ( j, k) > 0, we have xi

j t f ij (t) = 0 for all t ∈ W i

j .Recall (from Section 4) that we assumed the graph on the states with edges from

j to k if qi ( j, k) > 0 is strongly connected. Therefore, by repeating this argument,we get ∀ j, t ∈ W i

j , xij t f i

j (t) = 0.By Condition (3), since hi > 0, there exists j ∈ Si and t ∈ W i

j , such that xij t > 0

(or else the sum in Condition (3) is zero). However, since xij t f i

j (t) = 0, this impliesthat f i

j (t) = 0, which implies that f ij (1) = 0 by the MONOTONE property. Since

xij t > 0, using Condition (4) and plugging in f i

j (t) = 0, we get λ + thi = r ij .

Moreover, by plugging in f ij (1) = 0 into the t = 1 constraint of (BALANCE), we

get λ + hi ≥ r ij . These two facts imply that λ + hi = r i

j . This implies that thepolicy that starts with arm i in state j and always plays this arm obtains per-stepreward λ + hi > OPT/2.

We finally show that for any arm i such that hi > 0, and state k ∈ Si , if �Pik < 0,

then λ+hi = r ik + f i

k (1)�Pik . Observe that Inequality (2) is tight for some t ∈ W i

k .If it is not tight for t = 1, then since f i

k (t) is nondecreasing in t and since �Pik < 0,

it will not be tight for any t . Thus, we have a contradiction, proving this part.

In the remaining discussion, we assume that this lemma does not find an arm ithat yields reward at least OPT/2. This means that ∀i, k, there exists some t ∈ W i

kthat makes Inequality (2) tight.



FIG. 9. The BALANCEDINDEX Policy for MONOTONE MAB.

4.4. THE BALANCEDINDEX POLICY. Start with the optimal solution to (BAL-ANCE). First, throw away the arms for which hi = 0. Let S = {i : hi > 0}. ByLemma 4.2,

∑i∈S hi ≥ OPT/2. Define the following quantities for each of these

arms.

Definition 7. For each i ∈ S and state k ∈ Si , let t ik be the smallest value of

t ∈ W ik for which λ + thi = r i

k + f ik (t)�Pi

k in the optimal solution to (BALANCE).

Note that by Lemma 4.3, t ik is well defined for every k ∈ Si .

Definition 8. For arm i , partition the states Si into states Gi , Ii as follows:

(1) k ∈ Gi if �Pik < 0. (By Lemma 4.3, t i

k = 1.) These states are termed good.

(2) k ∈ Ii if �Pik ≥ 0. If arm i has been in state k ∈ Ii for less than t i

k steps, itis defined to be not ready for play. Once it has waited t ≥ t i

k steps, it becomesready and can be played.

With this notation, the policy is now presented in Figure 9. In this policy, if armi moves to a state in k ∈ Gi , it is continuously played until it moves to a state in Ii .Initially, assume at most one arm i is in a state k ∈ Gi , so that this invariant holdsthroughout the algorithm. (The initialization step is easy: Play each arm i in turntill it enters a state not in Gi .)

Intuitively, the states inGi are the “exploitation” or “good” states. On the contrary,the states in Ii are “exploration” or “bad” states, so the policy waits until it has ahigh enough probability of exiting these states before playing them. In both cases,tk corresponds to the “recovery time” of the state, which is 1 in a “good” state butcould be large in a “bad” state.

Lyapunov Function Analysis. We use a Lyapunov (potential) function argumentto show that the policy described in Figure 9 is a 2-approximation. Define thepotential φi for each arm i at any time as follows. (Recall the definition of t i

k fromDefinition 7, as well as the quantities λ, hi from the optimal solution of (BALANCE).)

Definition 9. If arm i moved to state k ∈ Si some y steps ago (y ≥ 1 bydefinition), the potential φi is pi

k + hi (min(y, t ik) − 1).

Therefore, whenever the arm i enters state k, its potential is pik . If k ∈ Ii , the

potential then increases at rate hi for t ik − 1 steps, after which it remains fixed

until the arm is played. Our policy plays arm i only if its current potential ispi

k + hi (t ik − 1).

We finally complete the analysis in the following lemma. The proof cruciallyuses the “balance” property of the dual, which implies that λ = ∑

i hi ≥ OPT/2.


3:28 S. GUHA ET AL.

Let T denote the total potential,∑n

i=1 φi , at any step T and let RT denote the totalreward accrued until that step. Define the function LT = T · OPT/2 − RT − T .Let �RT = RT +1 − RT and �T = T +1 − T .

LEMMA 4.4. LT is a Lyapunov function. i.e., E[LT +1|LT ] ≤ LT . (It followsthat E[LT ] ≤ E[L1].) Equivalently, at any step:

E[�RT + �T |RT , T ] ≥ OPT/2

PROOF. At a given step, suppose the policy does nothing, then all arms are “notready”. The total increase in potential is precisely �T = ∑

i hi ≥ OPT/2.On the other hand, suppose that the policy plays arm i , which is currently in

state k and has been in that state for y ≥ t ik steps. The change in reward �RT = r i

k .Moreover, the current potential of the arm must be i

T = pk + hi (t ik − 1). The new

potential follows the following distribution:

iT +1 =

{pi

j , with probability f ik (y)qi (k, j) ∀ j = k

pik, with probability 1 − ∑

j =k f ik (y)qi (k, j)

Therefore, if arm i is played, the change in potential is:

E[�iT ] = f i

k (y)∑

j∈Si , j =k

(qi (k, j)(pi

j − pik)

) − hi (tik − 1)

From the description of the INDEX policy, y = t ik = 1 if k ∈ Gi . Since at most one

arm can be in such a state, y > t ik only when k ∈ Ii . In that case �Pi

k ≥ 0 byDefinition 8, so that f i

j (y)�Pik ≥ f i

j (t ik)�Pi

k by the MONOTONE property (sincey ≥ t i

k).Therefore, for the arm i being played, regardless of whether k ∈ Gi or k ∈ Ii ,

�RT + E[�iT ] = r i

k + f ik (y)�Pi

k − hi (tik − 1)

≥ r ik + f i

k (t ik)�Pi

k − hi tik + hi

= λ + hi > OPT/2

where the last equality follows from the definition of t ik (Definition 7). Since the

potentials of the arms not being played do not decrease (since all hl > 0), the totalchange in reward plus potential is at least OPT/2. This completes the proof.

By their definition, the potentials T are bounded independent of the timehorizon. By telescoping summation, this lemma implies that limT →∞ E[RT ]

T ≥OPT/2. We finally have:

THEOREM 4.5. The BALANCEDINDEX policy is a 2 approximation for MONO-TONE bandits.

4.5. LOWER BOUNDS: NECESSITY OF MONOTONICITY AND SEPARABILITY. Weshow that MONOTONE bandits is NP-Hard, and that if the MONOTONE property isrelaxed even slightly, the problem either has �(n) integrality gap for Whittle’s LP,or becomes nε-hard to approximate.

Input Specification. In this discussion, we assumed the input to the MONOTONEbandits problem is specified by polynomial-size state spaces Si for each arm;



the associated matrices qi , and functions f ik (t) that are piecewise linear with poly-

size specification. We can model this problem as a restless bandit problem in thesense defined in literature by replacing each state k ∈ Si with infinitely manystates {kt , t ∈ Z+}; if the arm is not played, it transitions deterministically fromstate kt to kt+1, but if played in state kt , it transitions with probability qi (k, j) f i

k (t)to state j1 for each j ∈ Si , and with the remaining probability transitions to k1.The reduction uses infinitely many states, and is unlike the typical formulation ofrestless bandits that assumes the state space of each arm is poly-bounded. (ThePSPACE-Hardness proofs of restless bandits [Papadimitriou and Tsitsiklis 1999]assumes poly-bounded state space as well.) We therefore need to use differentNP-Hardness proofs for our compact input specifications.

THEOREM 4.6. For the special case of the problem with K = 2 states per armand n arms, the following are true even when the functions f i

k are piece-wise linearwith poly-size specification:

(1) Computing the optimal ergodic policy for MONOTONE bandits is NP-Hard.

(2) If the MONOTONE property on f is relaxed, then the problem becomes nε hardto approximate unless P = NP.

PROOF. We reduce from the following periodic scheduling problem, whichis shown to be NP-Complete in Bar-Noy et al. [1998]: Given n positive integersl1, l2, . . . , ln such that

∑ni=1 1/ li ≤ 1, is there an infinite sequence of integers

{1, 2, . . . , n} such that for every i ∈ {1, 2, . . . , n}, all consecutive occurrencesof i are exactly li elements apart. Given an instance of this problem, for eachi ∈ {1, 2, . . . , n}, we define an arm i with a “good” state g and a “bad” state w .

For part 1, for every arm i , let r ig = 1, and r i

w = 0. Set qi (g, w) = 1 andf ig (t) = 1 for all t . Moreover, set qi (w, gi ) = 1 and f i

w (t) = 0 if t ≤ 2li − 2and 1 otherwise. Suppose for a moment that we only have arm i , then the optimalpolicy will play the arm exactly 2li − 1 steps after it is observed to be in w , and thearm will transition to state g. The policy will then play the arm in state g to obtainreward 1, and the arm will transition back to state w . Since this policy is periodicwith period 2li , it yields long term average reward exactly 1

2li. It is easy to see that

any other ergodic policy of playing this arm yields strictly smaller reward per step.Any policy of playing all the arms therefore has total reward of at most

∑ni=1

12li

.

But for any ergodic policy, the reward of∑n

i=11

2liis achievable only if each arm i

is played according to its individual optimal policy, which is twice in successionevery 2li steps. But deciding whether this is possible is equivalent to solving theperiodic scheduling problem on the li . Therefore, deciding whether the optimalpolicy to the MONOTONE bandit problem yields reward

∑ni=1

12li

is NP-Hard.For part (2), we make w a trapping state with no reward. For arm i , set

qi (g, w) = qi (w, g) = 1; and f ig (li ) = 0 and f i

g (t) = 1 for all t = li . Furthermore,f iw (t) = 0 for all t . Also set r i

g = li and r iw = 0. Therefore, for any arm i , any policy

will obtain reward from this arm if and only if it chooses the start state to be g, andplays the arm periodically once every li steps to obtain average reward 1. There-fore, approximating the value of the optimal policy is the same as approximatingthe size of the largest subset of {l1, l2, . . . , ln} so that this subset induces a periodicschedule. The NP-Hardness proof of periodic scheduling in Bar-Noy et al. [1998]


3:30 S. GUHA ET AL.

shows that this problem as hard as approximating the size of the largest subsetof vertices in a graph whose induced subgraph is bipartite, which is nε hard toapproximate [Lund and Yannakakis 1993] unless P = NP.

In this proof, we showed that the problem becomes hard to approximate if thetransition probabilities are nonmonotone. However, that does not address the ques-tion of how far we can push our technique. We give a negative result by showingthat Whittle’s LP can have arbitrarily large gap even if the MONOTONE banditproblem is slightly generalized by preserving the monotone nature of the transi-tion probabilities, but removing the additional separable structure that they shouldbe of the form f i

k (t)qi (k, j). In other words, the transition probability from state kto state j = k if played after t steps is qi

k j (t) – these are arbitrary monotonicallynondecreasing functions of t . We insist

∑j =k qi

jk(t) ≤ 1 for all k, t to ensurefeasibility. We show that Whittle’s LP has �(n) gap for this generalization.

THEOREM 4.7. If the separability assumption on transition probabilities isrelaxed, Whittle’s LP has �(n) gap even with K = 3 states per arm.

PROOF. The arms are all identical. Each has 3 states, {g, b, a}. State a is anabsorbing state with 0 reward. State g has reward 1, and state b has reward 0. Thetransition probabilities are as follows: qab(t) = qag(t) = 0. Further, qgb(t) = 1/2,qga(1) = 0; and qga(t) = 1/2 for t ≥ 2. Finally, qba(t) = qbg(t) = 0 for t < 2n−1;qbg(2n − 1) = 1/2; qba(2n − 1) = 0; and qba(t) = qbg(t) = 1/2 for t ≥ 2n.

A feasible single arm policy involves playing the arm in state b after exactly2n − 1 steps (with probability 1/2, the state transitions to g), and continuously instate g (with probability 1/2, the state transitions to b). This policy never entersstate a. The average rate of play is 1/n. The per-step reward of this policy is�(1/n). Whittle’s LP chooses this policy for each arm so that the total rate of playis 1 and the objective is �(1).

Now consider any feasible policy that plays at least 2 arms. If one of these armsis in state g, there is a non-zero probability that either this arm is played after t > 1steps, or the other arm in state b is played after t ≥ 2n steps. In either case, withprobability 1/2, the arm enters absorbing state. Since this is an infinite horizonproblem, this event happens with probability 1. Therefore, any feasible policy isrestricted to playing only one arm in the long run, and obtains reward at most1/n.

5. Extensions of MONOTONE Bandits

In this section, we consider several extensions to the MONOTONE bandit problems,incorporating multiple simultaneous plays of varying duration, and switching costs.We show that the same solution framework as in the previous section easily extendsto these variants. Since the proofs are parallel to those in the previous section, weonly outline the differences.

5.1. MULTIPLE SIMULTANEOUS PLAYS OF VARYING DURATION. We first extendthe index policy for MONOTONE bandits to handle multiple plays of varying dura-tion. We use the same problem description as in Section 4, except we assume thereare M ≥ 1 players, each of which can play one arm every time step. (Therefore,M plays can proceed simultaneously per step.)



FIG. 10. The linear program (BALANCE) that we actually solve.

Furthermore, we assume that if arm i in state k ∈ Si is played, this play takesLi

k ≥ 1 steps and during this time, this player cannot play another arm. We notethat the values Li

k are fixed beforehand, and the players are aware of these values.When the player plays arm i in state k, he/she is forced to remain on arm i for Li

ksteps, and he/she only receives one reward of magnitude r i

k , at the beginning of this“blocking” period.

Suppose when the current play begins, the previous play had ended t ≥ 1 stepsago. Then, at the end of the current play, the arm transitions to one of the statesj = k with probability qi (k, j) f i

k (t), and with the remaining probability stays instate k. In Section 4, we focused on the case where M = 1 and all Li

k = 1.Since the overall algorithm and analysis are very similar to that in Section 4, we

simply outline the differences. First, Whittle’s LP gets modified as follows:

Maximizen∑

i=1

∑k∈Si

∑t≥1

r ik x i

kt (WHITTLE)

∑ni=1

∑k∈Si

∑t≥1 Li

k xikt ≤ M∑

k∈Si

∑t≥1(t + Li

k − 1)xikt ≤ 1 ∀i∑

j∈Si , j =k

∑t≥1 xi

kt qi (k, j) f i

k (t) = ∑j∈Si , j =k

∑t≥1 xi

j t qi ( j, k) f i

j (t) ∀k ∈ Si

x ikt ≥ 0 ∀i, k ∈ Si , t ≥ 1

In the above formulation, the first constraint merely encodes that in expectationM arms are played per step. Note that each play of arm i in state k lasts Li

k steps,and the play begins with probability xi

kt , so that the steady state probability thatarm i in state k is being played at any time step is

∑t≥1 Li

k xikt . Note now that if the

play begins after t steps, then the arm was idle for t − 1 steps before this event.Therefore, the quantity

∑t≥1(t + Li

k − 1)xikt would be the steady state probability

that the arm i is in state k. This summed over all k must be at most 1 for any arm i .This is the second constraint. The final constraint encodes that the rate of leavingstate k in steady state (left-hand side) must be the same as the rate of entering statek (right-hand side).

5.1.1. Balanced Program and Complementary Slackness. The balanced linearprogram is in Figure 10. (Recall the definition of �Pi

k (t) from Eq. 1.)Next, Lemma 4.3 gets modified as follows:

LEMMA 5.1. In the optimal solution to (BALANCE), one of the following is truefor every arm i with hi > 0: Either repeatedly playing the arm yields per-stepreward at least λ + hi ; or for every state k ∈ Si , there exists t ∈ W i

k such that the


3:32 S. GUHA ET AL.

FIG. 11. The INDEX policy.

following LP constraint is tight with equality.

Lik(λ + hi ) + (t − 1)hi ≥ r i

k + f ik (t)�Pi

k (t) (5)

We next split the arms into two types:

Definition 10.

(1) Arm i ∈ U1 if repeatedly playing it yields average per-step reward at least λ+hi .(Our policy described in the next section favors these arms and continuouslyplays them.)

(2) Arm i ∈ U2 if i /∈ U1 and hi > 0. (Note that for i ∈ U2, ∀k, ∃ t ∈ W ik that

makes Inequality (5) tight.)

LEMMA 5.2. For any arm i ∈ U2 and state k ∈ Si , if �Pik (t) < 0, then:

Lik(λ + hi ) = r i

k + f ik (1)�Pi

k (t)

5.1.2. BALANCEDINDEX Policy

Definition 11. For each i ∈ U2 and state k ∈ Si , let t ik be the smallest value

of t ∈ W ik for which Inequality (2) is tight. (By Lemma 5.1, t i

k is well-defined forevery k ∈ Si .)

Definition 12. For arm i ∈ U2, partition the states Si into states Gi , Ii asfollows:

(1) k ∈ Gi if �Pik (t) < 0. (By Lemma 5.2, t i

k = 1.) These states are termed good.

(2) k ∈ Ii if �Pik (t) ≥ 0. If arm i has been in state k ∈ Ii for less than t i

k steps, itis defined to be not ready for play. Once it has waited t ≥ t i

k steps, it becomesready and can be played.

Finally, the BALANCEDINDEX policy is described in Figure 11. Note that any armi ∈ U2 that is observed to be in a state in Gi is continuously played until its statetransitions into Ii . This preserves the invariant that at most M − |U1| arms i ∈ U2are in states k ∈ Gi at any time step.

5.1.3. Lyapunov Function Analysis. Define the potential for each arm in U2 atany time as follows:



Definition 13. If arm i ∈ U2 moved to state k ∈ Si some y steps ago (y ≥ 1by definition), the potential is pi

k + hi (min(y, t ik) − 1).

Therefore, whenever the arm i ∈ U2 enters state k, its potential is pik . If k ∈ Ii ,

the potential then increases at rate hi for t ik − 1 steps, after which it remains fixed

until a play completes for it. When our policy decides to play arm i ∈ U2, itscurrent potential is pi

k + hi (t ik − 1).

We finally complete the analysis in the following lemma. The proof cruciallyuses the “balance” property of the dual, which states that Mλ = ∑

i hi ≥ OPT/2.Let T denote the total potential at any step T and let RT denote the total rewardaccrued until that step. Define the function LT = T · OPT/2 − RT − T . Let�RT = RT +1 − RT and �T = T +1 − T .

LEMMA 5.3. LT is a Lyapunov function. that is, E[LT +1|LT ] ≤ LT . Equiva-lently, at any step:

E[�RT + �T |RT , T ] ≥ OPT/2

PROOF. Arms i ∈ U1 are played continuously and yield average per step rewardλ + hi , so that for any such arm i being played, E[�RT ] = λ + hi .

Next focus on arms i ∈ U2. As before, it is easy to show that when played,regardless of whether k ∈ Gi or k ∈ Ii ,

�Ri + E[�i ] = r ik + f i

k (y)∑

j∈Si , j =k

(qi (k, j)

(pi

j − pik

)) − hi(t ik − 1

)≥ r i

k + f ik

(t ik

)�Pi

k (t) − hi(t ik − 1

) = Lik(λ + hi )

where the last equality follows from the definition of t ik (Definition 11). Since the

play lasts Lik time steps, the amortized per step change for the duration of the play,

�RT + E[�T ], is equal to λ + hi .We finally bound the increase in reward plus potential at any time step. At step

T , let Sg denote the arms in U1 and those in U2 in states k ∈ Gi . For arms in U2with states k ∈ Ii , let Sr denote the “ready” arms and let Sn denote the set of armsthat are not “ready”. There are two cases. If |Sg ∪ Sr | ≥ M , then some Sp ⊆ Sg ∪ Srwith |Sp| = M is being played.

�RT + E[�T ] ≥∑i∈Sp

(λ + hi ) ≥ Mλ ≥ OPT/2

Next, if |Sg ∪ Sr | < M , then all these arms are being played.

�RT + E[�T ] =∑

i∈Sg∪Sr

(λ + hi ) +∑i∈Sn

hi ≥∑

i∈Sg∪Sr ∪Sn

hi =∑

i

hi ≥ OPT/2

Since the potentials of the arms not being played do not decrease (since allhl > 0), the total change in reward plus potential is at least OPT/2.

THEOREM 5.4. The BALANCEDINDEX policy in Figure 11 is a 2 approximationfor MONOTONE bandits with multiple simultaneous plays of variable duration.

5.2. SWITCHING COSTS. In several scenarios, playing an arm continuously in-curs no extra cost, but switching to a different arm incurs a closing cost for the oldarm and a setup cost for the new arm. For the applications mentioned in Section 1,


3:34 S. GUHA ET AL.

in the context of UAV navigation [Ny et al. 2008], this is the cost of moving theUAV to the new location; or in the case of wireless channel selection, this is thesetup cost of transmitting on the new channel.

We now show a 2-approximation for MONOTONE Bandits when the cost ofswitching out of arm i is ci and the cost of switching into arm i is si . This cost issubtracted from the reward. Note that the switching cost depends additively on theclosing and setup costs of the old and new arms. The remaining formulation is thesame as Section 4.

Since the overall policy and proof are very similar to the version without thesecosts, we only outline the differences. First, we define the following variables: Letxi

kt denote the probability of the event that arm i in state k is played after t stepsand this arm was switched into from a different arm. Let yi

kt denote the equivalentprobability when the previous play was for the same arm. The LP relaxation is asfollows:

Maximizen∑

i=1

∑k∈Si

∑t∈W i

k

(r i

k

(xi

kt + yikt

) − (ci + si ) xikt

)(LPSWITCH)

∑ni=1

∑k∈Si

∑t∈W i

k

(xi

kt + t yikt

) ≤ 1∑k∈Si

∑t∈W i

kt(xi

kt + yikt

) ≤ 1 ∀i∑j∈Si , j =k

∑t∈W i

k(xi

kt + yikt )q

i (k, j) f ik (t) = ∑

j∈Si , j =k

∑t∈W i

j(xi

j t + yij t )q

i ( j, k) f ij (t) ∀i, k

xikt , yi

kt ≥ 0 ∀i, k, t

In this formulation, the only nonintuitive term is t yikt in the first constraint. To

check its validity, note that in this case, the previous play was for the same arm andthe system has to block for t steps till the next play, so that there are no plays inbetween. The balanced dual is the following. (Recall the definition of �Pi

k (t) fromEq. (1).).

Minimize λ +n∑

i=1

hi (DUALSWITCH)

λ + thi ≥ r ik − ci − si + f i

k (t)�Pik (t) ∀i, k, t

t(λ + hi ) ≥ r ik + f i

k (t)�Pik (t) ∀i, k, t

λ = ∑ni=1 hi

λ, hi ≥ 0 ∀i

The proof of the next claim follows from complementary slackness exactly asthe proof of Lemma 4.3.

LEMMA 5.5. In the optimal solution to (DUALSWITCH), one of the following istrue for every arm i with hi > 0: Either repeatedly playing the arm yields per-stepreward at least λ + hi ; or for every state k ∈ Si , there exists t ∈ W i

k such that oneof the following two LP constraints is tight with equality:

(1) λ + thi ≥ r ik − ci − si + f i

k (t)�Pik (t).

(2) t(λ + hi ) ≥ r ik + f i

k (t)�Pik (t).



Only consider arms with hi > 0. The next lemma is similar to Lemma 4.3:

LEMMA 5.6. For any arm i and state k ∈ Si , if �Pik (t) < 0, then λ + hi =

r ik + f i

k (1)�Pik (t).

For arm i , let t ik denote the smallest t for which some dual constraint for state k

(refer Lemma 5.5) is tight. The state k ∈ Si belongs to Gi if the second constraintin Lemma 5.5 is tight at t = t i

k , that is,

t ik(λ + hi ) = r i

k + f ik (t i

k)�Pik (t) (6)

By Lemma 5.6, this includes the case where �Pik (t) < 0, so that t i

k = 1.Otherwise, the first constraint in Lemma 5.5 is tight at t = t i

k . This state k belongsto Ii , and becomes “ready” after t i

k steps.With these definitions, the BALANCEDINDEX policy is as follows: Stick with an

arm i as long as its state is some k ∈ Gi , and play it after waiting t ik − 1 steps.

Otherwise, play any “ready” arm. If no “good” or “ready” arm is available, thenidle.

THEOREM 5.7. The BALANCEDINDEX policy is a 2-approximation for MONO-TONE bandits with switching costs.

PROOF. The definitions of the potentials and proof are the same as Lemma 4.4.The only difference is that the potential of state k ∈ Gi is defined to be fixed at pi

k .Whenever the player sticks to arm i in state k ∈ Gi and plays it after waiting t i

k − 1steps, the reward plus change in potential amortized over the t i

k steps (waiting plusplaying) is exactly λ+hi by Eq. (6). The rest of the proof is the same as before.

6. FEEDBACK MAB with Observation Costs

In wireless channel scheduling, the state of a channel can be accurately determinedby sending probe packets that consume energy. However, data transmission at highbit-rate yields only delayed feedback about channel quality. This aspect can bemodeled by decoupling observation about the state of the arm via probing, fromthe process of utilizing or playing the arm to gather reward (data transmission). Wemodel this as a variant of the FEEDBACK MAB problem, where at any step, M armscan be played without observing its state, and the reward of the underlying state isdeposited in a bank. Further, any arm can be probed by paying a cost to determineits underlying state, and multiple such probes are allowed per step. The goal isto maximize the difference between the time average reward and probing cost. Aversion of the probe problem was first proposed in a preliminary draft of Guhaet al. [2008].

Formally, we consider the following variant of the FEEDBACK MAB problem.As before, the underlying 2-state Markov chain (on states {g, b})) correspondingto an arm evolves irrespective of whether the arm is played or not. When arm iis played, a reward of ri or 0 (depending on whether the underlying state is g orb respectively) is deposited into a bank. Unlike the FEEDBACK MAB problem, theplayer does not get to know the reward value or the state of the arm. However,during the end of any time step, the player can probe any arm i by paying cost ci toobserve its underlying state. We assume that the probes are at the end of a time step,and the state evolves between the probe and the start of the next time step. More


3:36 S. GUHA ET AL.

than one arm can be probed and observed any time step, but at most M arms canbe played, and the plays are of unit duration. The goal as before is to maximize theinfinite horizon time-average difference between the reward obtained from playingthe arms and the probing cost spent. Denote the difference between the reward andthe probing cost as the “value” of the policy.

Though the probe version is not a MONOTONE bandit problem, we show thatthe above techniques can indeed be used to construct a policy which yields a2 + ε-approximation for any fixed ε > 0.

6.1. LP FORMULATION. Let OPT denote the value of the optimal policy. Thefollowing is now an LP relaxation for the optimal policy. Let xi

gt (respectively xibt )

denote the probability that arm i was last observed to be in state g (respectively b) ttime steps ago and played at the current time step. Let zi

gt (respectively zibt ) denote

the probability that arm i was last observed to be in state g (respectively. b) t stepsago and is probed at the current time step. The probes are at the end of a time step,and the state evolves between the probe and the start of the next time step. The LPformulation is as follows; as before the LP can be solved up to a 1 + ε factor.

Maximizen∑

i=1

∑t≥1

(ri

(uit x

igt + vit x

ibt

) − ci

(zi

gt + zibt

))∑n

i=1

∑t≥1 xi

gt + xibt ≤ M∑

t≥1 t(zi

gt + zibt

) ≤ 1 ∀i

x ist ≤ ∑

l≥t zisl ∀i, t ≥ 1, s ∈ {g, b}∑

t≥1(1 − uit )zigt = ∑

t≥1 vit zibt ∀i

x ist , zi

st ≥ 0 ∀i, t ≥ 1, s ∈ {g, b}

The dual assigns a variable φist ≥ 0 for each arm i , state s ∈ {g, b}, and last

observed time t ≥ 1. It further assigns variables hi , pi ≥ 0 per arm i , and λ ≥ 0globally. Let Ri

st be the expected reward of playing arm i in state s when lastobserved time is t . (Ri

gt = ri uit , Rigt = ri vit .) The balanced dual is as follows:

Minimize Mλ +∑

i

hi

λ + φist ≥ Ri

st ∀i, t ≥ 1, s ∈ {g, b}thi ≥ −ci − (1 − uit )pi + ∑

l≤t φigl ∀i, t ≥ 1

thi ≥ −ci + vit pi + ∑l≤t φi

bl ∀i, t ≥ 1

Mλ = ∑i hi

λ, hi , pi , φist ≥ 0 ∀i.s ∈ {g, b}

We omit explicitly writing the corresponding primal. Note now that in the dualoptimal solution, φi

st = max(0, Rist − λ), s ∈ {g, b}. (This is the smallest value

of φist satisfying the first constraint, and whenever we reduce φi

st , we preserve thelatter constraints while possibly reducing hi .) Moreover, we have the followingcomplementary slackness conditions:

(1) hi > 0 ⇒ ∑t≥1 t(zi

gt + zibt ) = 1 + ω > 0.



(2) zigt > 0 ⇒ thi = −ci − (1 − uit )pi + ∑

l≤t φigl .

(3) zibt > 0 ⇒ thi = −ci + vit pi + ∑

l≤t φibl

LEMMA 6.1. Focus only on arms for which hi > 0. For these arms, we havethe following.

(1) For at least one t ≥ 1, zigt > 0, and similarly, for some (possibly different) t ,

zibt > 0.

(2) Let di = min{t ≥ 1, zibt > 0}, then di hi = −ci + vidi pi + ∑

l≤diφi

bt . Further,define mi = |{φi

bl > 0 : l ≤ di }|, then φibl > 0 for di − mi + 1 ≤ l ≤ di and

φibl = 0 for l ≤ di − mi .

(2) Let ei = min{t ≥ 1, zigt > 0}. Then, for all t ≤ ei , λ + φi

gt = Rigt . Moreover,

ei (λ + hi ) = ∑t≤ei

Rigt − ci − (1 − uiei )pi .

PROOF. For part (1), by complementary slackness and using hi > 0, we have∑t≥1 t(zi

gt + zibt ) > 0. But if zi

gt > 0 for some t , then by∑

t≥1(1 − uit )zigt =∑

t≥1 vit zibt , we have zi

bt > 0 for some (possibly different) t . The reverse holds aswell.

Part (2) follows by complementary slackness on zibdi

> 0. The second partfollows from the fact that φi

bl is non-decreasing since Ribl is non-decreasing.

For part (3), since zigei

> 0, by complementary slackness, ei hi = −ci − (1 −uiei )pi + ∑

t≤eiφi

gt . Note that φigt = max(0, Ri

gt − λ). If ei = 1, then since theleft-hand side is positive, it must be that φi

g1 > 0, which implies that φig1 = Ri

g1 −λ.If ei > 1, then we subtract (ei − 1)hi ≥ −ci − (1 − ui(ei −1))pi + ∑

t≤ei −1 φigt from

the equality and get hi ≤ (uiei − uiei −1)pi +φigei

. The left-hand side is positive andthe first term of the RHS is negative, so φi

gei> 0. Since φi

gt by the above formulais nonincreasing, φi

gt > 0 ∀t ≤ ei . This in turn implies that φigt = Ri

gt − λ for allt ≤ ei . Substituting this back into the equality yields the second result.

6.2. INDEX POLICY. Let the set of arms with hi > 0 be S, we ignore allarms except those in S. The policy uses the parameters ei , di and mi defined inLemma 6.1, If arm i was observed to be in state b, we denote it “not ready” for thenext di − mi steps, and denote it to be “ready” at the end of the (di − mi )th step.

THEOREM 6.2. The policy in Figure 12 is a 2 + ε approximation to FEEDBACKMAB with observation costs.

PROOF. Let OPT denote the 1 + ε approximate LP solution. Recall that φist =

max(0, Rist − λ), s ∈ {g, b}.

Define the following potentials for each arm i . If it was last observed to bein state b some t steps ago, define its potential to be (min(t, di − mi ))hi ; if itwas last observed in state g, define its potential to be pi . We show that the time-average expected value (reward minus cost) plus change in potential per step is atleast min(Mλ,

∑i∈S hi ) ≥ OPT/2. Since the potentials are bounded, this proves a

2-approximation.Each ready arm in Stage 1 is played for mi steps and probed at the end of

the mthi step. Suppose that the arm was last observed to be in state b some t


3:38 S. GUHA ET AL.

FIG. 12. The policy for FEEDBACK MAB with observations.

steps ago. The total expected value is −ci + ∑t+mi −1l=t Ri

bl , which is at least −ci +∑dil=di −mi +1 Ri

bl since Ribl = ri vil is nondecreasing in l. The expected change in

potential is vi(t+mi −1) pi − (di − mi )hi , since the arm loses the potential buildup of (di − mi )hi while it was not ready, and has a probability of vi(t+mi −1) ofbecoming good. This is at least vidi pi − (di − mi )hi since by definition, t ≥di − mi + 1. After mi steps, the total expected value plus change in potential is atleast −ci +

∑dil=di −mi +1 Ri

bl +vidi pi − (di −mi )hi ≥ mi hi +∑di

l=di −mi +1(Ribl −φi

bl).The inequality follows by Lemma 6.1 Part (2). Since Ri

bl −φibl = λ for di −mi +1 ≤

l ≤ di , the total expected change in value plus potential is mi (λ + hi ). Thus, theaverage per step for the duration of the plays is at least λ + hi . (This proof alsoshows that if mi = 0, then the probing on the previous step does not decrease thepotential.)

Similarly, each arm i in Stage 2 was probed and found to be good, so that itis exploited for ei steps and probed at the end of the ei th step. During these eisteps, the total expected value is

∑t≤ei

Rigt − ci , and expected change in potential

is −(1 − uiei )pi , since the arm has probability (1 − uiei ) of being in a bad state atthe end. By Lemma 6.1, Part (3), the total expected value plus change in potential is∑

t≤eiRi

gt −ci − (1−uiei )pi = ei (λ+hi ), so the average change per step is λ+hi .Now, if M arms are currently in Stage 1 or 2, then the total value plus change

in potential for these arms is at least Mλ ≥ OPT/2. If fewer than M arms are inthose stages, then every arm i that is not in Stage 1 or Stage 2 is in state b and not“ready.” Thus, its change in potential is hi . Moreover, for every arm j that is inStage 1 or Stage 2, we also get a contribution of at least λ + h j ≥ h j . Summing,we get a expected value plus change in potential of at least

∑i∈S hi ≥ OPT/2,

which completes the proof.

7. NonPreemptive Machine Replenishment

Finally, we show our technique of balancing provides a 2-approximation for an un-related, yet classic, restless bandit problem [Bertsekas 2001; Goseva-Popstojanovaand Trivedi 2000; Munagala and Shi 2008]: modeling breakdown and repair ofmachines. Interestingly, we also show that the Whittle index policy is an arbitrar-ily poor approximation to non-preemptive machine replenishment, and thus thetechnique we suggest can be significantly stronger than the Whittle index policies.



There are n independent machines whose performance degrades with time in aMarkovian fashion. At any step, any machine can be moved to a repair queue bypaying a cost. The repair process is non-preemptive, Markovian, and can work onat most M machines per time step. A scheduling policy decides when to move amachine to a repair queue and which machine to repair at any time slot. The goalis to find a scheduling policy to maximize the time-average difference betweenrewards and repair cost. Note that if an arm is viewed as a machine, playing itcorresponds to repairing it, and does not yield reward. In that sense, this problemis like an inverse of the MONOTONE bandits problem.

Formally, there are n machines. Let Si denote the set of active states for machinei . If the state of machine i is u ∈ Si at the beginning of time t , the state evolvesinto v ∈ Si at time t + 1 with probability puv . The state transitions for differentmachines when they are active are independent. If the state of machine i is u ∈ Siduring a time step, it accrues reward ru ≥ 0. We assume each Si is poly-size.

During any time instant, machine i in state u ∈ Si can be scheduled for main-tenance by moving it to the repair queue starting with the next time slot by payingcost cu . The maintenance process for machine i takes time that is distributed asGeometric(si ), independent of the other machines. Therefore, if the repair processworks on machine i at any time step, this repair completes after that time step withprobability si . During the time when the machine is in the repair queue, it yields noreward. When the machine is in the repair queue, we denote its state by κi . The main-tenance process is nonpreemptive, and the server can maintain at most M machinesat any time. When a repair completes, the machine i returns to its “initial activestate” ρi ∈ Si at the beginning of the next time slot. The goal is to design a schedul-ing policy so that the time-average reward minus maintenance cost is maximized.

We now show that our duality-based technique yields a 2-approximation policywith general Si , M , and nonpreemptive repairs.5

7.1. LP FORMULATION AND DUAL. We now present an LP bound on the optimalpolicy. For any policy, let xu denote the steady state probability that machine i isin state u during a time step, and zu denote the steady state probability that themachine i transitions from state u ∈ Si to state κi . We assume the policy moves amachine to the repair queue at the beginning of a time slot, and that repairs finishat the end of a time slot. Note that it does not make sense to repair a machine in itsinitial state xρi so zρi = 0.

Maximize∑

i

∑u∈Si

(ru xu − cu zu)

∑i xκi ≤ M

xκi + ∑u∈Si

xu ≤ 1 ∀i∑v∈Si ,v =u xv pvu = zu + ∑

v∈Si ,v =u xupuv ∀i, u ∈ Si \ {ρi }si xκi + ∑

v∈Si ,v =u xv pvρi = ∑v∈Si ,v =ρi

xρi pρi v ∀i

zu, xu ≥ 0 ∀i, u ∈ Si ∪ {κi }

5 Note that our result holds for Si being any Markov chain; for the machine replenishment problem,it will typically be the case that the Markov chain is a DAG rooted at ρi .


3:40 S. GUHA ET AL.

FIG. 13. The repair policy from our LP-duality approach.

The dual of the above LP assigns potentials φu for each state u ∈ Si . Further, itassigns a value hi ≥ 0 for each machine i , and a global variable λ ≥ 0. We directlywrite the balanced dual:

Minimize Mλ +∑

i

hi

λ + hi ≥ siφρi ∀i

hi ≥ ru + ∑v∈Si

puv (φv − φu) ∀i, u ∈ Si

φu + cu ≥ 0 ∀i, u ∈ Si

Mλ = ∑i hi

λ, hi ≥ 0 ∀i

Note that Mλ = ∑i hi ≥ OPT/2. We omit explicitly writing the corresponding

primal formulation. Now, Focus only on machines for which hi > 0. We have thefollowing complementary slackness conditions:

(1) hi > 0 ⇒ xκi + ∑u∈Si

xu = 1 − ω > 0

(2) xu > 0 ⇒ hi = ru + ∑v∈Si

puv (φv − φu).(3) zu > 0 ⇒ φu + cu = 0.(4) xκi > 0 ⇒ λ + hi = siφρi .

7.2. INDEX POLICY AND ANALYSIS

CLAIM 7.1. Consider only machines with hi > 0. There are two cases:

(1) For machines in which zv > 0 for some v, we have xκi > 0 so that the policycan only reach states u ∈ Si in which xu + zu > 0.

(2) For machines in which zv = 0 for all v, we have xκi = 0. The policy will neverrepair the machine, and after a finite number of steps,, the machine will onlyvisit states u ∈ Si for which xu > 0.

PROOF. Adding the third and fourth constraints of the primal yields si xκi =∑u∈Si

zu . If for some v , zv > 0, then xκi > 0, which by the fourth constraint in theprimal implies that xρi > 0. Now, suppose that xv > 0, then for every state u suchthat pvu > 0, the third constraint in the primal implies that zu + xu > 0. If zu > 0,then the policy will stop at state u and enter machine i into the repair queue. Ifzu = 0, then it must be that xu > 0. Repeatedly using this argument starting atv = ρi , we see that the policy will only visit states with xu + zu > 0, not goingbeyond the first state where zu > 0.

For machines in which zv = 0 for all v , conditions (3) and (4) in the primalimply that {xv} are the steady state probabilities of a Markov chain with transition



matrix [puv ]. Therefore, after a finite number of steps, the machine will only go tostates u ∈ Si for which xu > 0.

THEOREM 7.2. The policy in Figure 13 is a 2-approximation for nonpreemptivemachine replenishment.

PROOF. We next interpret φu as the potential for state u ∈ Si . Let the potentialfor state κi be 0. We show that in each step, the expected reward plus change inpotential is at least OPT/2.

First, when any active machine i enters a state u with zu > 0, then the machineis moved to the repair queue by paying cost cu . The potential change is −φu ,and the sum of the cost and potential change is −cu − φu . The last term is 0 bycomplementary slackness. Therefore, moving a machine to the repair queue doesnot alter the potential.

Next, let Sr denote the set of machines in the repair queue, and let Sw ⊆ Srdenote the subset of these machines being repaired at the current time. Note that if|Sr | < M , then Sw = Sr , otherwise, |Sw | = M . For each machine i ∈ Sw , the repairfinishes with probability si , and the machine’s potential changes by φρi . Therefore,the expected change in potential per step is siφρi = λ + hi by complementaryslackness.

Suppose first that |Sw | = M , then the net reward plus change in potential is at leastM(λ + hi ) > Mλ ≥ OPT/2. Suppose that |Sw | < M , then must have Sw = Sr .Note that any machine that enters a state u with zu > 0 will be automaticallymoved to Sr at the beginning of the time step. Using this along with Claim 7.1, wehave that for all but finitely many steps, all machines i /∈ Sr are in states u withxu > 0. (Since we care about infinite horizon average reward, the finite numberof steps don’t matter.) The reward plus change in potential for machine i /∈ Sr isru + ∑

v∈Sipuv (φv − φu) = hi by complementary slackness. Therefore, the total

reward plus change in potential is∑

i∈Sr(λ + hi ) + ∑

i /∈Srhi ≥ ∑

i hi ≥ OPT/2.Since the potentials are bounded, the policy is a 2-approximation.

7.3. GAP OF THE WHITTLE INDEX. We now show that the Whittle index policyis an arbitrarily poor approximation to non-preemptive machine replenishment.Note that in the situation shown below, Whittle’s index is a 1.51 approximationwhen repairs can be preempted [Munagala and Shi 2008]. However, when nopreemption is allowed, the policy can perform arbitrarily poorly.

THEOREM 7.3. The Whittle index policy is an arbitrarily poor approximationfor non-preemptive machine replenishment even with 2 machines and M = 1repairs per step.

PROOF. Suppose M = 1, there are two machines {1, 2}, and Si = {ρi , bi }for machines i ∈ {1, 2}. Let rρi = ri and let rbi = 0, so that the machine iseither “active” (state ρi ) or “broken” (state bi ). Let pi denote the probability oftransitioning from state ρi to bi . Assume ci = 0. Note that playing a machinecorresponds to moving it to the repair queue.

The Whittle index of a state is the largest penalty that can be charged per mainte-nance step so that the optimal single machine policy will still schedule the machinefor maintenance on entering the current state. In 2-state machines previously men-tioned the Whittle index in state ρi is negative, since even with penalty zero perrepair step, the policy will not schedule the machine for maintenance in the good


3:42 S. GUHA ET AL.

state. The Whittle index for state bi is ηi = siri/pi , since for this value of penalty,the expected reward of ri/pi per renewal is the same as the expected penalty ofηi/si paid for maintenance in the renewal period.

Suppose s1 = 1/n4, s2 = 1, p1 = 1/n, p2 = 1, r1 = r2 = 1. If used by itself,machine 1 yields reward r1

s1s1+p1

≈ 1/n3 and machine 2 yields reward r2s2

s2+p2= 1

2 .Any reasonable policy will therefore only maintain machine 2 and ignoremachine 1. However, in the Whittle index policy, when machine 1 is broken andmachine 2 is active, the policy decides to maintain machine 1 (since the Whittleindex, η1, of b1 is positive and that of ρ2 is negative). In this case, machine 1 isscheduled for repair. This repair takes O(n4) time steps and cannot be interrupted.Moreover, since machine 2 is bad at least half the time, this “blocking” by machine1 will happen with rate O(1/2), so in the long run, machine 2 is almost alwaysbroken and the Whittle index policy obtains reward O(1/n3), while the optimalpolicy obtains reward r2

11+1 = 1/2 by only maintaining machine 2.

8. Open Questions

Our work throws open interesting research avenues. First, can our algorithmictechniques be extended to other subclasses of restless bandits, for instance, thePOMDP problem obtained by generalizing FEEDBACK MAB to K > 2 states perarm? Note that unlike the K = 2 case considered here, the transition probabilityvalues are no longer monotone as they are based on an underlying Markov chain.Next, can matching hardness results be shown for these problems, particularlyFEEDBACK MAB? Finally, our analysis effectively uses piece-wise linear Lyapunovfunctions. Such functions derived from LP relaxations have also been used byBertsimas, Gamarnik, and Tsitsiklis [Bertsimas et al. 2002] to show stability inmulti-class queuing systems. Though the techniques and results in that work arevery different from ours, it would be interesting to explore whether our techniquesextend to multi-class queuing problems.

Appendix

A. Omitted Proofs

The following proofs are deferred here because they are independent of our dualitybased technique, and because of their lengths, we fear that they may detract fromthis article’s flow.

A.1. PROOF OF THEOREM 2.1. We show examples in which the myopic policyand the optimal index policy exhibit the desired gaps against the optimum.

Gap of the Myopic Policy. We now show an instance where the myopic policythat plays the arm with the highest expected next-step reward has gap �(n) withrespect to the reward of the optimal policy. In this instance, there is one “type 1”deterministic arm with reward r1 = 1. There are n independent and identicallydistributed “type 2” arms with r2 = n, β = 1

2n , and αα+β

= 1n .

First consider the myopic policy. Any policy encounters an instant where all thetype 2 arms are in state b. In this case, the myopic next step reward of any of thesearms is at most r2

αα+β

< 1, so that the policy always plays the type 1 arm, yieldinglong-term reward of 1.


leonid

Comment on Text

Appendix A

leonid

Sticky Note

Appendix A

leonid

Comment on Text

Why?


TABLE I. DESCRIPTION OF THE OPTIMAL POLICY

WHEN BOTH ARMS 2, 3 WERE LAST OBSERVED TO BE b.

State (k1, k2) Play Armk1 ≤ 3, k2 < k1 1k1 = 4, k2 ≤ 2 1k1 = 4, k2 = 3 2k1 = 4, k2 > 4 3k1 ≥ 5, k2 < k1 2

Note that the policy is symmetric with respect to arms2, 3, and furthermore, k1 = k2.

Next consider the myopic policy that ignores the type 1 arm. Such a policyperforms round-robin on the arms when it observes all of them to be in state b. Inthis case, the probability that the arm it plays will be in state g is at least vn ≥ 1

2n .Therefore, the behavior of this policy is dominated by the following 2-state Markovchain: The two states are h and l; state h yields reward n, and state l, reward 0. Thetransition probabilities from h to l and vice-versa are 1

2n . The long-term reward istherefore at least n

2 , which lower-bounds the reward of the optimal policy.

Nonoptimality of Index Policies. We will now show an instance where there is aconstant factor gap between the optimal policy and the optimal index policy. Theexample has 3 arms. Arm 1 is deterministic with reward r1 = 1. Arms 2 and 3 arei.i.d. with α = β = 0.1 and reward in state g being r2 = 2.

We compute the optimal policy by value iteration [Bertsekas 2001] using adiscount factor of γ = 0.99 (to ensure the dynamic program converges). Theoptimal policy always plays arms 2 or 3 if either was just observed in state g. Thedecisions are complicated only if both arms 2, 3 were last observed in state b. In thiscase, we can compactly represent the current state by the pair (k1, k2) ∈ Z+ × Z+,representing the time steps ago that arms 2, 3 were observed in state b respectively.For such a state, the policy either plays arm 1; or plays arms 2 or 3 depending onwhether k1 > k2 or not. Such a policy is therefore completely characterized by theregion D on the (k1, k2) plane where the decision is to play arm 1; in the remainingregion, it plays arm 2 or 3 depending on whether k1 > k2 or not. For the optimalpolicy, we have:

D∗ = {(k1, k2) ∈ Z+ × Z+ | k1 ≤ 4, k2 ≤ 4, k1 + k2 ≤ 6

}In other words, the description of the optimal policy is as follows (note that it issymmetric with respect to arms 2, 3) [Table I]:

Note the following nonindex behavior where given the state of arms 1 and 2,the decision to play switches between these arms depending on the state of arm 3.If arm 2 was observed to be b some four steps ago, then: (i) If arm 3 was b sometwo steps ago, the policy plays arm 1; (ii) If arm 3 was b some three steps ago,the policy plays arm 2. To compute the reward of this policy, we observe that ithas an equivalent description as a Markov Chain over 6 states (these new statescorrespond to groups of states in the original process). A closed form evaluationof this chain shows that the reward of the optimal policy is 1.46218.

We next perform this evaluation for the nearby index policies. Note that for anyindex policy, the region D has to be an axis-parallel square. The first is where thedecision for k1 = 4, k2 ≤ 2 is to play arm 2, so that D = {(k1, k2)|k1, k2 ≤ 3}.


leonid

Highlight

leonid

Highlight

leonid

Highlight

leonid

Highlight

leonid

Comment on Text

It is assumed here that the index of the fitst channel is constant c. Hence channel 1 will be selected when c >= h(k_2) and c>= h(k_3), i.e. the region where 1 is selected is a square

leonid

Comment on Text

The reasoning seems to be: If all states are b, then there will be one which was not probed for at least n times. Since a+b < 1, if the channels are all b, the myopic policy selects the channel with the largest probing time, hence the probability that the channel is g is at least q=(a/(a+b))(1-(1-a-b)^n , which is approximately equal to na=2^{-n}/(n-1) > 2^{-n}. Hence the average length of time the system is probed in b state until it moves to g state is at most 1/q. Once a channel is found i g state, a+b <1 and the policy is myopic, it will keep selecting the g channels until it is found bad. The average length of time in this case is 1/\beta. Since during this time a reward of n is gained, the average reward is nb/(a+b) >= n/2

3:44 S. GUHA ET AL.

FIG. 14. Behavior of the function F(λ, t).

This policy evaluates to an average reward 1.46104. The next is where the decisionfor k1 = 4, k2 = 3 is to play arm 1, so that D = {(k1, k2)|k1, k2 ≤ 4}. This policyhas reward 1.46167. Other index policies have only worse reward. This impliesthat there is a constant factor gap between the optimal policy and the best indexpolicy.

A.2. PROOF OF LEMMA 2.5. Before proving the lemma, we begin with somenotation. Since we focus on a particular arm, we will drop the subscript corre-sponding to the arm.

Definition 14. For policy P(t), let R(t) denote the expected per-step reward,and let Q(t) denote the expected rate of play. Let F(λ, t) = R(t) − λQ(t) denotethe value of P(t). Also define:

t(λ) = argmaxt≥1 F(λ, t) = argmaxt≥1 R(t) − λQ(t) (7)

Finally, let R(λ) = R(t(λ)) and Q(λ) = Q(t(λ)) [Figure 14].

Note that this definition implies L(λ) = P(t(λ) so that H (λ) = maxt≥1 R(t) −λQ(t) = R(t) − λQ(t). Since each P(t) corresponds to a Markov Chain, it isstraightforward to obtain closed form expressions for R(t) and Q(t).

LEMMA A.1. In playing an arm with reward r, transition probabilities α andβ, the policy P(t) yields average reward R(t) = r vt

vt +tβ , and expected rate of play

Q(t) = vt +β

vt +tβ ≥ 1t . Recall that vt = α

α+β(1 − (1 − α − β)t ) is the probability the

arm is good given it was observed to be bad t steps ago.

PROOF. The Markov chain describing the policy P(t) is shown in Figure 15,and has t +1 states which we denote s, 1, 2, 3, . . . , t . The state s corresponds to thearm being observed to be in state g, and the state j corresponds to the arm beingobserved in state b exactly j − 1 steps ago. The transition probability from state jto state j + 1 is 1, from state s to state 1 is β, from state t to state s is vt , and fromstate t to state 1 is 1−vt . Let πs, π1, π2, . . . , πt denote the steady state probabilitiesof being in states s, 1, 2, . . . , t respectively. This Markov chain is easy to solve.We have π1 = π2 · · · = πt , so that the first identity is: πs + tπ1 = 1. Furthermore,by considering transitions into and out of s, we obtain: βπs = vtπt = vtπ0.



FIG. 15. Markov Chain for policy P(t).

Combining these, we obtain: πs = vtvt +tβ , and π1 = β

vt +tβ . Now we have:

R(t) = r [(1 − β)πs + vtπ1] = rπs = rvt

vt + tβ

Q(t) = πs + πt = vt + β

vt + tβ.

LEMMA A.2 (LEMMA 2.5). For each arm i, the optimal reward minus penaltyof the single arm policy for arm i is

H (λ) = maxt≥1

F(λ, t) = maxt≥1

((r − λ)vt − λβ

vt + tβ

)The maximum value t(λ) = argmaxt≥1 F(λ, t) satisfies the following:

(1) If λ ≥ r ( αα+β(α+β) ), then t(λ) = ∞, and H (λ) = 0.

(2) If λ = r ( αα+β(α+β) ) − ρ for some ρ > 0, then t(λ) (and hence H (λ)) can be

computed in time polynomial in the input size and in log(1/ρ) by binary search.

PROOF. Since the proof focuses on a single arm i , we omit the subscript for thearm. For notational convenience, denote t∗ = t(λ). The expression for H (λ) followseasily from Lemma A.1. Recall from Definition 14 that F(λ, t) = R(t) − λQ(t) asthe value of policy P(t).

Case 1. λ ≥ r ( αα+β(α+β) ). Consider the subcase λ ≥ r . The function F(λ, t) is

maximized by driving the expression (which is always nonpositive) to zero. Thishappens when t = ∞. Otherwise, when r > λ observe (using the upper bound ofvt ) that

F(λ, t) = (r − λ)vt − λβ

vt + tβ≤

(r − λ) αα+β

− λβ

vt + tβ

This is now nonpositive, and it follows again that t = ∞ is the optimum solution.Case 2. In this case, let λ = r ( α

α+β(α+β) ) − ρ for some ρ > 0. Rewrite thepreceding expression as

F(λ, t) = (r − λ) − βλ + t(r − λ)

vt + tβ


3:46 S. GUHA ET AL.

Define the following quantities (independent of t):

ν = (1 − α − β) η = α

α + βln

1

νφ = ηλ + α

α + β(r − λ)

μ = η(r − λ) ω = λβ − αr − λ

α + β= −ρ

α + β(α + β)

α + β

Observe r −λ > ρ. Note that φ, μ ≥ 0. By assumption, the value ν ∈ (δ, 1] haspolynomial bit complexity. The same holds for η, φ, μ and ρ. Relaxing t to be areal, observe:

∂ F(λ, t)

∂t= β

(vt + tβ)2

((φ + μt)ν t + ω

)Since the denominator of ∂ F/∂t is always non-negative, the value of t∗ is either

t∗ = 1, or the point where the sign of the numerator g(t) = (φ+μt)ν t +ω changesfrom + to −. We observe that g(t) has a unique local maximum at t3 = 1

ln(1/ν) − φ

μ.

If g(t3) is negative, then the numerator of ∂ F(λ,t)∂t is always negative and the optimum

solution is at t∗ = 1.If g(t3) is positive, then it cannot change sign from + to − in the range [1, t3)

since it has a unique maximum. Therefore, in this range, t = 1, t = �t3�, or t = �t3�are the optimum solutions.

But for t ≥ t3 since g(t) is decreasing, ∂ F/∂t changes sign once from + to− as t increases, and → 0 as t → ∞. This behavior is illustrated in Figure 14.Therefore, we find a t4 > t3 such that g(t4) < 0, and perform binary search inthe range [t3, t4] to find the point where F is maximized. It is easy to compute t4with polynomial bit complexity in the complexities of ν, η, φ, μ and ρ. We finallycompare this maximal value of F to the values of F at 1, �t3�, �t3�. Thus, we cansolve H (λ) and obtain t∗ in polytime.

A.3. PROOF OF THEOREM 2.9. We first show the structure of the optimal so-lution to (WHITTLE). Using the notation from Definition 14, we have: Hi (λ) =Ri (λ) − λQi (λ). Let R(λ) = ∑n

i=1 Ri (λ) and Q(λ) = ∑ni=1 Qi (λ). The following

lemma shows that the optimal solution to (WHITTLE) is obtained by choosing λsuch that Q(λ) ≈ 1.

LEMMA A.3. The optimal solution to Whittle’s LP chooses a penalty λ∗ and afraction a ∈ [0, 1], so that aQ(λ∗

−)+ (1−a)Q(λ∗+) = 1. Here, λ∗

− ≤ λ∗ < λ∗+ with

|λ∗+ − λ∗

−| → 0. The solution corresponds to a convex combination of Pi (ti (λ∗−))

with weight a and Pi (ti (λ∗+)) with weight 1 − a for each arm i.

PROOF. For the optimal solution to (WHITTLE), recall that OPT denote theexpected reward. The expected rate of playing the arms is exactly 1 by the LPconstraint.

When λ = 0, then ti (λ) = 1 for all i , implying Q(λ) = n. Similarly, whenλ = λmax ≥ maxi ri , ti (λ) = ∞ for all i , so that Q(λ) = 0. Therefore, as λ isincreased from 0 to λmax, there is a transition value λ∗ such that Q(λ∗

−) = Q1 ≥ 1,and Q(λ∗

+) = Q2 < 1; furthermore, |λ∗+ − λ∗

−| → 0.Since the solution to (WHITTLE) is feasible for LPLAGRANGE(λ), we must have:

R(λ∗+) − λ∗Q2 ≥ OPT − λ∗ R(λ∗

−) − λ∗Q1 ≥ OPT − λ∗



Let a = 1−Q2Q1−Q2

, then taking the convex combination of the above inequalities,we obtain:

a R(λ∗−) + (1 − a)R(λ∗

+) ≥ OPT

aQ(λ∗−) + (1 − a)Q(λ∗

+) = 1.

This completes the proof.

To prove Theorem 2.9, consider n i.i.d. arms with nβ � 1, α = β/(n − 1) andr = 1. Each arm is in state g with probability 1/n, so that all arms are in state bwith probability 1/e and the maximum possible reward of any feasible policy is1 − 1/e even with complete information about the states of all arms.

We will show that Whittle’s LP has value 1 − O(√

nβ) for nβ � 1. Since theLP is symmetric with respect to the arms, it is easy to show (from Theorem A.3)that for each arm, it will construct the same convex combination of two single-armpolicies. The first policy is of the form P(t −1), and the second is of the form P(t).The constraint is that if these policies are executed independently, exactly one armis played in expectation per step. Since P(t) has lower average reward and rate ofplay than P(t − 1), we consider the suboptimal LP solution that uses policy P(t)for each arm.

The policy P(t) always plays in state g, and in state b, waits t steps beforeplaying. The value t is chosen so that the rate of play for each arm is less than1/n, and P(t − 1) has a rate of play larger than 1/n. The rate of play of the singlearm policy P(t) is given by the formula: Q(t) = β+vt

tβ+vt. Since this is 1/n, we have

vt = β(t − n)/(n − 1). The reward of each arm is R(t) = vttβ+vt

= t−nn(t−1) , so that

the objective of Whittle’s LP is n R(t) = 1 − �(n/t).Now, from vt = β(t − n)/(n − 1), we obtain 1 − (1 − β ′)t = β ′(t − n), where

β ′ = α +β = β nn−1 . This holds for t = �(

√n/β) provided nβ � 1. Plugging this

value of t into the value n R(t) of Whittle’s LP completes the proof of Theorem 2.9.

A.4. PROOF OF LEMMA 3.1. Recall the notation from Appendix A.2 andDefinition 14. We first present the following structural lemma about the opti-mal single-arm policy Li (λ). Suppose this policy is of the form Pi (ti (λ)), whereti (λ) = argmaxt≥1 Fi (λ, t).

LEMMA A.4. ti (λ) is monotonically nondecreasing in λ.

PROOF. We have: ∂ Fi (λ,t)∂λ

= −Qi (t) = − vit+βi

vi t+tβi. Since Qi (t) is a decreasing

function of t , the above is an increasing function and always negative, whichimplies that for smaller t , the function Fi (λ, t) decreases faster as λ is increased.This implies that if ti (λ) = argmaxt≥1 Fi (λ, t), then for λ′ ≥ λ, the maximum ofFi (λ′, t) is attained for some ti (λ′) ≥ ti (λ).

Now note that when λ = 0, there is no penalty, so that the single-arm policymaximizes its reward by playing every step regardless of the state. Therefore,�i (s, t) ≥ 0 for all states (s, t).6

6Note that this is true only for FEEDBACK MAB where the underlying 2-state process evolves regardlessof the plays; the claim need not be true for MONOTONE bandits defined in Section 4, where even withpenalty λ = 0, the arm may idle in certain states.


3:48 S. GUHA ET AL.

Suppose the arm is in state (g, 1). The immediate expected reward if played isri (1 − βi ). If the penalty λ < ri (1 − βi ), a policy that plays the arm and stops laterhas positive expected reward minus penalty. Therefore, for penalty λ, the optimaldecision at state (g, 1) is “play”, so that �i (g, 1) ≥ ri (1 − βi ). We now showthat �i (g, 1) = ri (1 − βi ). Suppose the penalty is λ > ri (1 − βi ). If played instate (g, 1), the immediate expected reward minus penalty is negative, and leadsto the policy being in state (g, 1) or (b, 1). The best possible total reward minuspenalty in the future is obtained by always playing in state (g, 1) and waiting aslong as possible in state (b, 1) (since this maximizes the chance of going to state gif played). Whenever the arm is played in state b after w steps, the probability ofobserving state g is at most αi

αi +βi. Consider two consecutive events of the policy

when the last play was in state (g, 1) and the current observed state is (b, 1). Sincethe optimal such policy is ergodic, this interval would define a renewal period. Inthis period, the expected penalty is at least λ(α+β

α+ 1

β), and the expected reward is

riβi

. Therefore, the next expected reward minus penalty in the renewal period is:

ri − λ

βi− λ

αi + βi

αi< ri

(1 − (1 − βi )

(1 + βi

αi

))= −ri

βi

αi(1 − βi − αi ) < 0

.The last inequality follows since αi + βi ≤ 1 − δ for a δ > 0 specified as

part of input. This implies that if λ > ri (1 − βi ), the any policy that plays in state(g, 1) has negative net reward minus penalty, showing that “not playing” is optimal.Therefore, �i (g, 1) = ri (1 − βi ).

Next assume that for λ = ri (1 −βi ) −γ where γ > 0 is some small number, thepolicy decision is to “play” in state (b, t). Consider the smallest such t . Since thepolicy also decides to play in (g, 1), consider the renewal period defined by twoconsecutive events where the policy when the last play was in state (g, 1) and thecurrent observed state is (b, 1). The reward is ri

βiand the penalty is λ( 1

vit+ 1

βi). Since

λ = ri (1 − βi ) − γ for some very small γ > 0, and vit < αiαi +βi

, the above analysisshows that the net expected reward minus penalty is negative in renewal period.Therefore, the decision in (b, t) is to “not play”, so that �i (b, t) ≤ ri (1 − βi ).

Finally, for any λ < ri (1 − βi ), consider the smallest t ≥ 1 so that the optimaldecision in state (b, t) is to “play”. If this is finite, the optimal policy for this λ isprecisely Li (λ) = Pi (ti (λ)). From Lemma A.4, the function ti (λ) is nondecreasingin λ. Therefore, for any state (b, t∗), the quantity max{λ|Li (λ) = Pi (t∗)} is well-defined. For larger values of penalty λ, we have ti (λ) > t∗, so that the decision in(b, t∗) is “do not play”. Therefore, �i (b, t) = max{λ|Li (λ) = Pi (t)}. Since ti (λ)is nondecreasing in λ, the function �i (b, t) is nondecreasing in t . This completesthe proof of Lemma 3.1.

ACKNOWLEDGMENTS. Shivnath Babu, Jerome Le Ny, Ashish Goel, and AlexSlivkins for discussions concerning parts of this work. We also thank the anony-mous reviewers (both for this journal version as well as previous extended absracts)for several helpful comments.

REFERENCES

ABERNETHY, J., HAZAN, E., AND RAKHLIN, A. 2008. Competing in the dark: An efficient algorithmfor bandit linear optimization. In Proceedings of the Annual Conference on Learning Theory (COLT).263–274.



ADELMAN, D., AND MERSEREAU, A. J. 2008. Relaxations of weakly coupled stochastic dynamicprograms. Oper. Res. 56, 3, 712–727.

AHMAD, S. H. A., LIU, M., JAVIDI, T., ZHAO, Q., AND KRISHNAMACHARI, B. 2008. Optimality ofmyopic sensing in multi-channel opportunistic access. CoRR abs/0811.0637.

ANSELL, P. S., GLAZEBROOK, K. D., NINO-MORA, J. E., AND OIKEEFFE, M. 2003. Whittle’s indexpolicy for a multi-class queueing system with convex holding costs. Math. Meth. Oper. Res. 57, 21–39.

ARROW, K. J., BLACKWELL, D., AND GIRSHICK, M. A. 1949. Bayes and minmax solutions of sequentialdecision problems. Econometrica 17, 213–244.

ASAWA, M., AND TENEKETZIS, D. 1996. Multi-armed bandits with switching penalties. IEEE Trans.Auto. Control 41, 3, 328–348.

AUDIBERT, J.-Y., AND BUBECK, S. 2009. Minimax policies for adversarial and stochastic bandits. InProceedings of the Annual Conference on Learning Theorys (COLT).

AUER, P. 2002. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3,397–422.

AUER, P., CESA-BIANCHI, N., AND FISCHER, P. 2002. Finite-time analysis of the multiarmed banditproblem. Mach. Learn. 47, 2-3, 235–256.

AUER, P., CESA-BIANCHI, N., FREUND, Y., AND SCHAPIRE, R. E. 2003. The nonstochastic multiarmedbandit problem. SIAM J. Comput. 32, 1, 48–77.

BANKS, J. S., AND SUNDARAM, R. K. 1994. Switching costs and the Gittins index. Econometrica 62, 3,687–694.

BAR-NOY, A., BHATIA, R., NAOR, J., AND SCHIEBER, B. 1998. Minimizing service and operation costsof periodic scheduling. In Procedings of the ACM-SIAM Symposium on Discrete Algorithms. 11–20.

BERTSEKAS, D. 2001. Dynamic Programming and Optimal Control, 2nd Ed. Athena Scientific.BERTSIMAS, D., GAMARNIK, D., AND TSITSIKLIS, J. 2002. Performance of multiclass Markovian

queueing networks via piecewise linear Lyapunov functions. Ann. of Appl. Prob. 11, 4, 1384–1428.BERTSIMAS, D., AND NINO-MORA, J. 1996. Conservation laws, extended polymatroids and multi-armed

bandit problems: A unified polyhedral approach. Math. Oper. Res. 21, 2, 257–306.BERTSIMAS, D., AND NINO-MORA, J. 2000. Restless bandits, linear programming relaxations, and a

primal-dual index heuristic. Oper. Res. 48, 1, 80–90.BREZZI, M., AND LAI, T.-L. 2002. Optimal learning and experimentation in bandit problems. J. Econ.

Dynam. Cont. 27, 1, 87–108.CESA-BIANCHI, N., FREUND, Y., HAUSSLER, D., HELMBOLD, D. P., SCHAPIRE, R. E., AND WARMUTH,

M. K. 1997. How to use expert advice. J. ACM 44, 3, 427–485.DE FARIAS, D. P., AND MEGIDDO, N. 2006. Combining expert advice in reactive environments. J.

ACM 53, 5, 762–799.FAIGLE, U., KERN, W., AND STILL, G. 2002. Algorithmic Principles of Mathematical Programming.

Kluwer, Dordrecht.FLAXMAN, A. D., KALAI, A. T., AND MCMAHAN, H. B. 2005. Online convex optimization in the bandit

setting: gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposiumon Discrete Algorithms (SODA’05). SIAM, Philadelphia, PA, 385–394.

GITTINS, J. C., AND JONES, D. M. 1972. A dynamic allocation index for the sequential design ofexperiments. Progress in Statistics (European Meeting of Statisticians).

GLAZEBROOK, K. D., AND MITCHELL, H. M. 2002. An index policy for a stochastic scheduling modelwith improving/deteriorating jobs. Naval Res. Log. 49, 706–721.

GLAZEBROOK, K. D., MITCHELL, H. M., AND ANSELL, P. S. 2005. Index policies for the maintenanceof a collection of machines by a set of repairmen. Europ. J. Oper. Res. 165, 1, 267–284.

GLAZEBROOK, K. D., RUIZ-HERNANDEZ, D., AND KIRKBRIDE, C. 2006. Some indexable families ofrestless bandit problems. Adv. Appl. Prob. 38, 643–672.

GOEL, A., GUHA, S., AND MUNAGALA, K. 2006. Asking the right questions: model-driven optimizationusing probes. In Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems (PODS’06). ACM, New York, NY, 203–212.

GOSEVA-POPSTOJANOVA, K., AND TRIVEDI, K. S. 2000. Stochastic modeling formalisms for depend-ability, performance and performability. In Performance Evaluation: Origins and Directions. 403–422.

GUHA, S., AND MUNAGALA, K. 2007a. Approximation algorithms for budgeted learning problems. InProceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC’07). ACM, NewYork, 104–113.


3:50 S. GUHA ET AL.

GUHA, S., AND MUNAGALA, K. 2007b. Approximation algorithms for partial-information basedstochastic control with Markovian rewards. In Proceedings of the 48th IEEE Symposium on Foundationsof Computer Science (FOCS). 483–493.

GUHA, S., AND MUNAGALA, K. 2007c. Model-driven optimization using adaptive probes. In Proceedingsof the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07). SIAM, Philadelphia,PA, 308–317.

GUHA, S., MUNAGALA, K., AND SARKAR, S. 2008. Information acquisition and exploitation inmultichannel wireless networks. CoRR abs/0804.1724.

GUHA, S., MUNAGALA, K., AND SHI, P. 2009. Approximation algorithms for restless bandit problems.In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’09). SIAM,Philadelphia, PA, 28–37.

HAWKINS, J. T. 2003. A Lagrangian decomposition approach to weakly coupled dynamic optimizationproblems and its applications. Ph.D. dissertation, Operations Research Center, Massachusetts Instituteof Technology.

KAELBLING, L. P., LITTMAN, M. L., AND CASSANDRA, A. R. 1998. Planning and acting in partiallyobservable stochastic domains. Artif. Intell. J. 101, 99–134.

KAKADE, S. M., AND KEARNS, M. J. 2005. Trading in Markovian price models. In Proceedings of theAnnual Conference on Learning Theory (COLT). 606–620.

KODIALAM, M. S., AND LAKSHMAN, T. V. 2007. Achievable rate region for wireless systems with timevarying channels. In Proceedings of INFOCOM. 53–61.

LAI, T. L., AND ROBBINS, H. 1985. Asymptotically efficient adaptive allocation rules. Adv. Appl.Math. 6, 4–22.

LITTLESTONE, N., AND WARMUTH, M. K. 1994. The weighted majority algorithm. Inform. Com-put. 108, 2, 212–261.

LIU, K., AND ZHAO, Q. 2008. A restless bandit formulation of multi-channel opportunistic access:Indexablity and index policy. Computing Research Repository, arXiv:0810.4658.

LUND, C., AND YANNAKAKIS, M. 1993. The approximation of maximum subgraph problems. In Proceed-ings of the 20th International Colloguium on Automata, Languages and Programming (ICALP). 40–51.

MUNAGALA, K., AND SHI, P. 2008. The stochastic machine replenishment problem. In Proceedings ofthe 13th International Conference on Integer Programming and Combinatorial Optimization (IPCO’08).Springer-Verlag, Berlin, 169–183.

NY, J. L., DAHLEH, M., AND FERON, E. 2008. Multi-UAV dynamic routing with partial observationsusing restless bandits allocation indices. In Proceedings of the American Control Conference.

PAPADIMITRIOU, C. H., AND TSITSIKLIS, J. N. 1999. The complexity of optimal queuing networkcontrol. Math. Ope. Res. 24, 2, 293–305.

ROBBINS, H. 1952. Some aspects of the sequential design of experiments. Bull. AMS 55, 527–535.SLIVKINS, A., AND UPFAL, E. 2008. Adapting to a changing environment: The Brownian restless

bandits. In (COLT). 343–354.SMALLWOOD, R., AND SONDIK, E. 1971. The optimal control of partially observable Markov processes

over a finite horizon. Oper. Res. 21, 1071–88.SONDIK, E. J. 1978. The optimal control at partially observable Markov processes over the infinite

horizon: Discounted costs. Oper. Res. 26, 2, 282–304.TSITSIKLIS, J. N. 1994. A short proof of the Gittins index theorem. Ann. Appl. Prob. 4, 1, 194–199.WALD, A. 1947. Sequential Analysis. Wiley, New York.WEBER, R. R., AND WEISS, G. 1990. On an index policy for restless bandits. J. Appl. Prob. 27, 3, 637–648.WEISS, G. 1988. Branching bandit processes. Probab. Engng. Inform. Sci. 2, 269–278.WHITTLE, P. 1981. Arm acquiring bandits. Ann. Probab. 9, 284–292.WHITTLE, P. 1988. Restless bandits: Activity allocation in a changing world. Appl. Prob. 25, A, 287–298.ZHAO, Q., KRISHNAMACHARI, B., AND LIU, K. 2007. On myopic sensing for multi-channel opportunistic

access. CoRR abs/0712.0035.ZINKEVICH, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. In

Proceedings of the International Conference on Machine Learning (ICML). 928–936.

RECEIVED FEBRUARY 2009; REVISED AUGUST 2010; ACCEPTED SEPTEMBER 2010


Documents

Aristotle University of Thessalonikiusers.auth.gr/leonid/public/books/AppoximationAlgorithmsForRestle… · 3 Approximation Algorithms for Restless Bandit Problems SUDIPTO GUHA University