Click here to load reader
Upload
ambrose-collins
View
215
Download
1
Embed Size (px)
Citation preview
Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking
Presented by Shihao Ji
Duke University Machine Learning Group
June 10, 2005
Authors: Vikram Krishnamurthy & Robin Evans
• Motivation
• Overview
• Multiarmed Bandits
• HMM Multiarmed Bandits
• Experimental Results
Outline
• ESA has only one steerable beam.• The coordinates of each target evolve according to a
finite state Markov chain.• Question: which single target should the tracker choose
to observe at each time instant in order to optimize some specified cost function?
Motivation
Overview - How it works?
• The Model
One has N parallel projects, indexed i=1,2,…,N and at each instant of discrete time can work on only a single project. Let the state of project i at time k be denoted . If one works on project i at time k then one pays an immediate expected cost of . The state changes to by a Markov transition rule (which may depend upon i, but not upon t), while the state of the projects one has not touched remain unchanged: for .The problem is how to allocate one’s effort over projects sequentially in time so as to minimize expected total discounted cost.
Multiarmed Bandits
)(iks
),( )( isR ik
)(1iks
)()(1
jk
jk ss ij
Gittins Index
• Simplest non-trivial problem, classic
• No essential solution until Gittins and his co-workers.
• They proved that to each project i one could attach an index,
,such that the optimal action at time k is to work on that project for which the current index is smallest. The index is calculated by solving the problem of allocating one’s effort optimally between project i and a standard project which yields a constant cost.
• Gittins’ result thus reduces the case of general N to that of the case N = 2.
)()()( ik
ii svv
HMM Multiarmed Bandits
• The “standard” multiarmed bandits problem involves a fully observed finite state Markov chain and is only a MDP with a rich structure.
• For the multitarget tracking, due to measurement noise at the sensor, the states are only partially observable. Thus, the multitarget tracking problem needs to be formulated as a multiarmed bandits involving HMMs (with the HMM filter to estimate the information state).
• Can be solved brute forcedly by POMDP, but it involves a much higher (enormous) dimensional Markov chain.
• Bandit assumption decouples the problem.
• The information state of currently observed target updates by the HMM filter:
• For the other P-1 unobserved target, their information states are kept frozen:
if target q is not observed
Bandit Assumption
)()()(
1)(
)()()(1
)()(1 1 p
kpp
kp
pk
ppk
ppk xAyB
xAyBx
)()(1
qk
qk xx
Why it is Valid?
• Slow Dynamics: slowly moving targets have a bandit structure. where
• Decoupling Approximation: without the bandit assumption, the optimal solution is
intractable. Bandit model is perhaps the only reasonable approximation that leads to computationally tractable solution.
• Reinitialization: a compromise. Reinitialize the HMM multiarmed bandits at regular intervals with
updated estimates from all targets.
)()( pp QIA 0
OxxAx qk
qk
qqk
)()()()(1
Some details
• Finite State Markov Assumption: denotes the quantized distance of the pth target from base station,
and the target distance evolves according to a finite-state Markov chain.
• Cost structure: typically depends on the distance of the pth target to the base
station, i.e., the target gets close to the base station pose a greater threat and given higher priority by the tracking algorithm.
• Objective function:
Spk dds ,,1
)(
psR pk ,)(
0
)(Ek
ukk
k kxuRJ
Optimal Solution
• For the bandit assumption, the optimal solution has an indexable (decoupling) rule, that is, the optimization can be decoupled into P independent optimization problems.
• For each target p, there is a function (Gittins index) . Solved by POMDP algorithms, see the next slide.
• The optimal scheduling policy at time k is to steer the beam toward the target with the smallest Gittins index
J
)()( pk
p x
)()(
,,1minarg p
kp
Ppxq
Gittins Index
• For arbitrary multiarmed bandits problem, the Gittins index can be calculated by solving an associated infinite horizon discounted control problem called the “return to state”.
• For the target p, given information state at time k, there are two actions:
1) Continue, which incurs a cost and evolves according to HMM filter;
2) Restart, which moves to a fixed information state , incurs a cost , and evolves according to HMM filter.
)( pkx
pxR pk ,)( )(
1pkx
)( px pxR p ,)( )(
1pkx
• The Gittins index of the state of target p is given by
where satisfies the Bellman equation:
)( px
)()()()()( , ppppp xxVx
)()()( , ppp xxV
POMDP solver
• Defining new parameters (see eq.15),
• Can be solved by any standard POMDP solver: such as sondik’s algorithm, witness algorithm, incremental-prune, or suboptimal (approximated) algorithms.
)()()()()()( , pppk
pppk xxVxxV
Experimental Results