15
Human Action Recognition without Human He Yun 1,2 , Soma Shirakabe 1,2 , Yutaka Satoh 1,2 , Hirokatsu Kataoka 1 1 Computer Vision Research Group, AIST, Japan 2 Human-Centered Vision Lab., University of Tsukuba, Japan

【ECCV 2016 BNMW】Human Action Recognition without Human

Embed Size (px)

Citation preview

Page 1: 【ECCV 2016 BNMW】Human Action Recognition without Human

Human Action Recognition without Human

He Yun1,2, Soma Shirakabe1,2, Yutaka Satoh1,2, Hirokatsu Kataoka1

1Computer Vision Research Group, AIST, Japan 2Human-Centered Vision Lab., University of Tsukuba, Japan

Page 2: 【ECCV 2016 BNMW】Human Action Recognition without Human

Motion representation

•  Database: UCF101, HMDB51, ActivityNet

•  Approach: IDT, Two-Stream CNN

–  DBs and approaches have been prepared in the field

Page 3: 【ECCV 2016 BNMW】Human Action Recognition without Human

Action Database

h"p://www.thumos.info/

Page 4: 【ECCV 2016 BNMW】Human Action Recognition without Human

The problem setting in action recognition

•  Video-level prediction

–  1 action-label prediction per input video

TennisSwing

Mo6onDescriptor

Page 5: 【ECCV 2016 BNMW】Human Action Recognition without Human

Dense Trajectories (DT) [Wang+, CVPR11]

•  Trajectory-based representation

–  A large amount of trajectories

–  Feature description (HOG, HOF, MBH)

–  Codeword vector is generated

Page 6: 【ECCV 2016 BNMW】Human Action Recognition without Human

Two-Stream CNN [Simonyan+, NIPS14]

•  Spatial and temporal convolution

–  Spatial-stream: From a RGB image

–  Temporal-stream: From a stacked flows

–  Score fusion: Average or SVM

Page 7: 【ECCV 2016 BNMW】Human Action Recognition without Human

Is background enough to classify actions?

•  RGB input is too strong!

–  The two-stream CNN[Simonyan+, NIPS14] reported spatial-stream can understand an

action more than expected

•  72.4% with spatial-stream (RGB) @UCF101

•  “Human Action Recognition without Human”

Page 8: 【ECCV 2016 BNMW】Human Action Recognition without Human

Without Human?

•  Human action recognition can be done just by motion of the

background?

TennisSwing

Mo6onDescriptor

TennisSwing?

Mo6onDescriptor

Page 9: 【ECCV 2016 BNMW】Human Action Recognition without Human

Detailed setting of w/ and w/o Human

•  With and without human setting

–  Without human setting: center-blind image with UCF101

–  With human setting: inverse of the without human setting

I(x,y) f(x,y)* I’(x,y)

1/2 1/41/4

1/2

1/4

1/4

I(x,y) f(x,y)* I’(x,y)

1/2 1/41/4

1/2

1/4

1/4ー ー

WithoutHumanSeIng WithHumanSeIng

Page 10: 【ECCV 2016 BNMW】Human Action Recognition without Human

Framework –  Baseline: Very deep two-stream CNN [Wang+, arXiv15]

–  Two different scenarios: without human and with human

Page 11: 【ECCV 2016 BNMW】Human Action Recognition without Human

Exploration experiment

•  @UCF101

–  UCF101 pre-trained model with very deep two-stream CNN

–  With/Without Human Setting

Page 12: 【ECCV 2016 BNMW】Human Action Recognition without Human

Visual results (Full Image)

Page 13: 【ECCV 2016 BNMW】Human Action Recognition without Human

Visual results (Without Human Setting)

Page 14: 【ECCV 2016 BNMW】Human Action Recognition without Human

Without Human

•  The concept of ”Human Action Recognition without Human”

–  The accuracies are very close

•  With human is +9.49% better than without human

–  The current motion representation heavily rely on the backgrounds

Page 15: 【ECCV 2016 BNMW】Human Action Recognition without Human

Future work

•  This is a suggestive reality

–  We must accept this reality to realize better motion representation

–  Pure motion representation is an urgent work!

•  More sophisticated approach

•  Human only motion