Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking Presented by Shihao Ji Duke University Machine Learning

Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking

Presented by Shihao Ji

Duke University Machine Learning Group

June 10, 2005

Authors: Vikram Krishnamurthy & Robin Evans

• Motivation

• Overview

• Multiarmed Bandits

• HMM Multiarmed Bandits

• Experimental Results

Outline

• ESA has only one steerable beam.• The coordinates of each target evolve according to a

finite state Markov chain.• Question: which single target should the tracker choose

to observe at each time instant in order to optimize some specified cost function?

Motivation

Overview - How it works?

• The Model

One has N parallel projects, indexed i=1,2,…,N and at each instant of discrete time can work on only a single project. Let the state of project i at time k be denoted . If one works on project i at time k then one pays an immediate expected cost of . The state changes to by a Markov transition rule (which may depend upon i, but not upon t), while the state of the projects one has not touched remain unchanged: for .The problem is how to allocate one’s effort over projects sequentially in time so as to minimize expected total discounted cost.

Multiarmed Bandits

)(iks

),( )( isR ik

)(1iks

)()(1

jk

jk ss ij

Gittins Index

• Simplest non-trivial problem, classic

• No essential solution until Gittins and his co-workers.

• They proved that to each project i one could attach an index,

,such that the optimal action at time k is to work on that project for which the current index is smallest. The index is calculated by solving the problem of allocating one’s effort optimally between project i and a standard project which yields a constant cost.

• Gittins’ result thus reduces the case of general N to that of the case N = 2.

)()()( ik

ii svv

HMM Multiarmed Bandits

• The “standard” multiarmed bandits problem involves a fully observed finite state Markov chain and is only a MDP with a rich structure.

• For the multitarget tracking, due to measurement noise at the sensor, the states are only partially observable. Thus, the multitarget tracking problem needs to be formulated as a multiarmed bandits involving HMMs (with the HMM filter to estimate the information state).

• Can be solved brute forcedly by POMDP, but it involves a much higher (enormous) dimensional Markov chain.

• Bandit assumption decouples the problem.

• The information state of currently observed target updates by the HMM filter:

• For the other P-1 unobserved target, their information states are kept frozen:

if target q is not observed

Bandit Assumption

)()()(

1)(

)()()(1

)()(1 1 p

kpp

kp

pk

ppk

ppk xAyB

xAyBx

)()(1

qk

qk xx

Why it is Valid?

• Slow Dynamics: slowly moving targets have a bandit structure. where

• Decoupling Approximation: without the bandit assumption, the optimal solution is

intractable. Bandit model is perhaps the only reasonable approximation that leads to computationally tractable solution.

• Reinitialization: a compromise. Reinitialize the HMM multiarmed bandits at regular intervals with

updated estimates from all targets.

)()( pp QIA 0

OxxAx qk

qk

qqk

)()()()(1

Some details

• Finite State Markov Assumption: denotes the quantized distance of the pth target from base station,

and the target distance evolves according to a finite-state Markov chain.

• Cost structure: typically depends on the distance of the pth target to the base

station, i.e., the target gets close to the base station pose a greater threat and given higher priority by the tracking algorithm.

• Objective function:

Spk dds ,,1

)(

psR pk ,)(

0

)(Ek

ukk

k kxuRJ

Optimal Solution

• For the bandit assumption, the optimal solution has an indexable (decoupling) rule, that is, the optimization can be decoupled into P independent optimization problems.

• For each target p, there is a function (Gittins index) . Solved by POMDP algorithms, see the next slide.

• The optimal scheduling policy at time k is to steer the beam toward the target with the smallest Gittins index

J

)()( pk

p x

)()(

,,1minarg p

kp

Ppxq

Gittins Index

• For arbitrary multiarmed bandits problem, the Gittins index can be calculated by solving an associated infinite horizon discounted control problem called the “return to state”.

• For the target p, given information state at time k, there are two actions:

1) Continue, which incurs a cost and evolves according to HMM filter;

2) Restart, which moves to a fixed information state , incurs a cost , and evolves according to HMM filter.

)( pkx

pxR pk ,)( )(

1pkx

)( px pxR p ,)( )(

1pkx

• The Gittins index of the state of target p is given by

where satisfies the Bellman equation:

)( px

)()()()()( , ppppp xxVx

)()()( , ppp xxV

POMDP solver

• Defining new parameters (see eq.15),

• Can be solved by any standard POMDP solver: such as sondik’s algorithm, witness algorithm, incremental-prune, or suboptimal (approximated) algorithms.

)()()()()()( , pppk

pppk xxVxxV

Experimental Results

Documents

Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking Presented by Shihao Ji Duke University Machine Learning