Upload
hirokatsu-kataoka
View
1.241
Download
2
Embed Size (px)
Citation preview
Human Action Recognition without Human
He Yun1,2, Soma Shirakabe1,2, Yutaka Satoh1,2, Hirokatsu Kataoka1
1Computer Vision Research Group, AIST, Japan 2Human-Centered Vision Lab., University of Tsukuba, Japan
Motion representation
• Database: UCF101, HMDB51, ActivityNet
• Approach: IDT, Two-Stream CNN
– DBs and approaches have been prepared in the field
Action Database
h"p://www.thumos.info/
The problem setting in action recognition
• Video-level prediction
– 1 action-label prediction per input video
TennisSwing
Mo6onDescriptor
Dense Trajectories (DT) [Wang+, CVPR11]
• Trajectory-based representation
– A large amount of trajectories
– Feature description (HOG, HOF, MBH)
– Codeword vector is generated
Two-Stream CNN [Simonyan+, NIPS14]
• Spatial and temporal convolution
– Spatial-stream: From a RGB image
– Temporal-stream: From a stacked flows
– Score fusion: Average or SVM
Is background enough to classify actions?
• RGB input is too strong!
– The two-stream CNN[Simonyan+, NIPS14] reported spatial-stream can understand an
action more than expected
• 72.4% with spatial-stream (RGB) @UCF101
• “Human Action Recognition without Human”
Without Human?
• Human action recognition can be done just by motion of the
background?
TennisSwing
Mo6onDescriptor
TennisSwing?
Mo6onDescriptor
Detailed setting of w/ and w/o Human
• With and without human setting
– Without human setting: center-blind image with UCF101
– With human setting: inverse of the without human setting
I(x,y) f(x,y)* I’(x,y)
1/2 1/41/4
1/2
1/4
1/4
I(x,y) f(x,y)* I’(x,y)
1/2 1/41/4
1/2
1/4
1/4ー ー
WithoutHumanSeIng WithHumanSeIng
Framework – Baseline: Very deep two-stream CNN [Wang+, arXiv15]
– Two different scenarios: without human and with human
Exploration experiment
• @UCF101
– UCF101 pre-trained model with very deep two-stream CNN
– With/Without Human Setting
Visual results (Full Image)
Visual results (Without Human Setting)
Without Human
• The concept of ”Human Action Recognition without Human”
– The accuracies are very close
• With human is +9.49% better than without human
– The current motion representation heavily rely on the backgrounds
Future work
• This is a suggestive reality
– We must accept this reality to realize better motion representation
– Pure motion representation is an urgent work!
• More sophisticated approach
• Human only motion