1
Experiment setting MatConvNet toolbox, i7-4790K CPU, Nvidia Titan X GPU. Trained on VOT dataset & evaluated on OTB-100 dataset. ADNet ( : 3000, : 250, : 10) 3 fps (Prec: 88%) ADNet-fast ( : 300, : 50, : 30) 15 fps (Prec: 85%) Analysis on action 93% of total frames have smaller than 5 actions Self-comparison init: no pre-training, only online adaptation SL: supervised learning SS: uses 1/10 gt annotations RL: reinforcement learning OTB-100 test results Sequential actions selected by ADNet Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi Visual tracking Find the target position in a new frame. Deep CNN-based tracking method (Tracking-by-detection) Problem Inefficient search strategy. Need lots of labeled video frames to train CNNs. Motivation Action-Decision Networks (ADNet) Experiments Problem setting (Markov decision process) Action-Decision Network Training method Tracking Sequences −1 +1 State transition 112 × 112 × 3 confidence 512 action 11 conv1 conv2 conv3 fc4 fc5 fc6 2 fc7 512 +110 51 × 51 × 96 11 × 11 × 256 3 × 3 × 512 +1 +1 Until ‘stop’ action Approach Action-driven tracking Dynamically capture the target by selecting sequential actions. Comparison with existing method. Action: , defined by discrete actions Scale changes Stop Translation moves State: = , Image patch: ∈ℝ 112∗112∗3 Action dynamics: ∈ℝ 110 Reward: =ቊ 1, if , > 0.7 −1, otherwise Previous frame CNN-based tracker [1] Our method Frame (t-1) Candidates Target or Background? CNN model Frame (t) Choose the best candidate (Pre-trained) VGG-M model Reinforcement learning Can handle the semi-supervised case. Run tracking to a piece of training sequence. Get trajectory of states, actions, and rewards { , , }. By REINFORCE algorithm, Train policy network to maximize the expected tracking rewards. Training videos state-action pairs Supervised learning Generate training samples of state-action pairs. Train policy network as multiclass classification with softmax. Then, reinforcement learning is performed to improve ADNet. Frame #160 Frame #190 Frame #220 Reward: +1 Reward: -1 log ; Selected action histogram [1] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking., CVPR 2016. 256 samples Avr. 5 actions Online Tracking Initial fine-tuning with initial samples. Online adaptation on every frames with online samples. Re-detection is conducted when the target is missed. ( 0 , 0 , 1 , 1 ,…, , ) ( +1 , +1 ,…, , )

Action-Decision Networks for Visual Tracking with Deep ... · OTB-100 test results Sequential actions selected by ADNet Action-Decision Networks for Visual Tracking with Deep Reinforcement

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Action-Decision Networks for Visual Tracking with Deep ... · OTB-100 test results Sequential actions selected by ADNet Action-Decision Networks for Visual Tracking with Deep Reinforcement

Experiment setting

• MatConvNet toolbox, i7-4790K CPU, Nvidia Titan X GPU.

• Trained on VOT dataset & evaluated on OTB-100 dataset.

• ADNet (𝑁𝐼: 3000, 𝑁𝑂: 250, 𝐼: 10) → 3 fps (Prec: 88%)

• ADNet-fast (𝑁𝐼: 300, 𝑁𝑂: 50, 𝐼: 30) → 15 fps (Prec: 85%)

Analysis on action

• 93% of total frames have

smaller than 5 actions

Self-comparison

• init: no pre-training, only online adaptation

• SL: supervised learning

• SS: uses 1/10 gt annotations

• RL: reinforcement learning

OTB-100 test results

Sequential actions selected by ADNet

Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning

Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and Jin Young Choi

Visual tracking

• Find the target position in a new frame.

• Deep CNN-based tracking method (Tracking-by-detection)

Problem

• Inefficient search strategy.

• Need lots of labeled video frames to train CNNs.

Motivation Action-Decision Networks (ADNet) Experiments

Problem setting (Markov decision process)

Action-Decision Network

Training method

Tra

ckin

g S

eq

ue

nce

s

𝐹𝑙

𝐹𝑙−1

𝐹𝑙+1

𝑎𝑡

𝑑𝑡…

𝑝𝑡

State transition

112 × 112 × 3

confidence

512

action

11

conv1

conv2conv3

fc4 fc5 fc6

2fc7

512 +110

51 × 51 × 9611 × 11 × 256

3 × 3 × 512

𝑝𝑡+1

𝑑𝑡+1

Until ‘stop’ action

Approach

Action-driven tracking

• Dynamically capture the target by selecting sequential actions.

• Comparison with existing method.

• Action: 𝑎𝑡 , defined by discrete actions

Scale changes StopTranslation moves

• State: 𝑠𝑡 = 𝑝𝑡 , 𝑑𝑡Image patch: 𝑝𝑡 ∈ ℝ112∗112∗3

Action dynamics: 𝑑𝑡 ∈ ℝ110

• Reward:

𝑟 𝑠𝑡 = ቊ1, if 𝐼𝑂𝑈 𝑝𝑡 , 𝐺 > 0.7

−1, otherwise

Previous frame CNN-based tracker [1] Our method

Frame (t-1)

Candidates Target or

Background?CNN model

Frame (t)

Choose th

e

best c

andid

ate

(Pre-trained) VGG-M model

• Reinforcement learning

Can handle the semi-supervised case.

Run tracking to a piece of training sequence.

Get trajectory of states, actions, and rewards { 𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 }.

By REINFORCE algorithm,

Train policy network to maximize the expected tracking

rewards.

Training videos

state-action pairs

• Supervised learning

Generate training samples of state-action pairs.

Train policy network as multiclass classification with softmax.

Then, reinforcement learning is performed to improve ADNet.

Frame #160 Frame #190 Frame #220

Reward: +1 Reward: -1

… …

𝛥𝑊 ∝

𝑡

𝑇𝜕log 𝑝 𝑎𝑡 𝑠𝑡;𝑊

𝜕𝑊𝑟𝑡

Selected action histogram

[1] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking., CVPR 2016.

256 samples Avr. 5 actions

Online Tracking

Initial fine-tuning with 𝑁𝐼 initial samples.

Online adaptation on every 𝐼 frames with 𝑁𝑂 online samples.

Re-detection is conducted when the target is missed.

(𝑠0, 𝑎0, 𝑠1, 𝑎1, … , 𝑠𝑡 , 𝑎𝑡)(𝑠𝑡+1, 𝑎𝑡+1, … , 𝑠𝑙 , 𝑎𝑙)