27
Spatio-Temporal Matching for Human Detection in Video Feng Zhou Fernando De La Torre 1

2014 eccv stm

  • Upload
    dija84

  • View
    190

  • Download
    0

Embed Size (px)

Citation preview

Temporal Segmentation of Human motion

Spatio-Temporal Matchingfor Human Detection in VideoFeng ZhouFernando De La Torre1

Good morning, everyone.

My name is Feng Zhou. I am working with Dr. Fernando de la Torre.

Today, I will present our work on spatial and temporal alignment of human behavior. 1

Human Detection in VideoView-point ChangeNon-rigid DeformationTemporal Misalignment

2

Occlusion & Noise

Challenges

Matching with 2D VideosMatching with 3D MocapsProposed Work

To introduce our method, lets consider the following example.

Suppose we are given a video of people kicking a ball. In order to recognize the action of this video, one way is to match it with other videos in a dataset.2

Previous WorkSingle-Frame DPMMulti-Frame DPMAndriluka et al., 2012Agarwal & Triggs, 20062D-3D LiftingSensitive to NoiseExpensive to OptimizeRequire Large Data3

Sapp et al., 2011

Yang & Ramanan, 2011Ferrari et al., 2008

Burgos et al., 2013DPM + Smoothing

Sigal & Black, 2006

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.3

Spatio-Temporal MatchingInput Video4

GoalMany-to-One Mapping

Motion Capture TemplatesFinding Correspondence between Trajectories

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 4

WorkflowInput Video5

Many-to-One Mapping

Motion Capture TemplatesJointResponse

DenseTrajectory

Spatio-TemporalShape ModelMatching

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 5

Dense TrajectoriesDense Trajectories and Motion Boundary Descriptors for Action Recognition, IJCV, 20136

2D Trajectory Coordinates

#Frames (eg, 15) / Segment#Trajectories (>800)

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 6

Joint ResponseArticulated pose estimation with flexible mixtures-of-parts, PAMI, 2013

N-best maximal decoders for part models,ICCV, 20117

HeadRight Foot

Trajectory-to-Joint Response

#Joints (=14)

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 7

WorkflowInput Video8

Many-to-One Mapping

Motion Capture TemplatesJointResponse

DenseTrajectory

Spatio-TemporalShape ModelMatching

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 8

Spatio-Temporal Shape Model9

Original SegmentsAfter AlignmentProcrustes analysis Cluster 1Cluster 3Cluster 2Cluster 4

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 9

Spatio-Temporal Shape Model10

Cluster 33D Trajectory Coordinates

WeightsShape BasesTrajectory Bases

Linear OperatorBilinear spatiotemporal basis modelsACM Trans. Graphics, 2012

DCTClosed-form Solution

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 10

Spatio-Temporal Shape Model11

Cluster 3

DCTBasesReconstructionShapeBases

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 11

WorkflowInput Video12

Many-to-One Mapping

Motion Capture TemplatesJointResponse

DenseTrajectory

Spatio-TemporalShape ModelMatching

To address these issues, we propose our method as follows

First, we extract 2D feature trajectories from video. To do that, we can use standard KLT trackers.

O

Given the 12

Weights

Trajectory Coordinates

Correspondence Matrix

3D-2D Projection

DCT Bases

Shape Bases

MatchingJoint Response

Many-to-One Mapping

Orthographic Projection

Many-to-One Mapping

13

Joint Response

Regularization

Weights

Trajectory Coordinates

Correspondence Matrix

3D-2D Projection

DCT Bases

Shape Bases

MatchingJoint ResponseMany-to-One Mapping

Linear ProgrammingL1 Procrustes Analysis

14

Results CMU Motion Capture Dataset15

20 outliersCamera 1Camera 2

5 Actions (Walk, Run, Jump, Kick, Swing)8 Sequences / Action0~200 Outliers4 Cameras14 Joints

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.15

16Results CMU Motion Capture Dataset

InputGreedySTM (Proposed)Kick

Walk

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.16

17Results CMU Motion Capture DatasetErrors (All Actions)#Outliers

Generic ModelAction-Specific Model

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.17

Results - Berkeley MHAD Dataset1812 Persons2 Cameras11 Actions

DPM (Yang & Ramanan) STM (Proposed)Jump14 Joints

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.18

Results - Berkeley MHAD Dataset1912 Persons2 Cameras11 ActionsDPM Yang & Ramanan STM ProposedSitDown14 Joints

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.19

Results - Berkeley MHAD Dataset2012 Persons2 Cameras11 ActionsDPM Yang & Ramanan STM ProposedThrow14 Joints

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.20

Results - Berkeley MHAD Dataset21

Generic ModelAction-Specific Model

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.21

Results Human 3.6M Dataset2211 Persons2 Cameras17 Actions14 Joints

Walk

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.22

Results Human 3.6M Dataset23

Generic ModelAction-Specific Model

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.23

24ConclusionSpatio-Temporal MatchingTemporal Smoothing by Trajectories

View-invariant Matching Efficient Linear Programming Solution

Future WorksTemporal Consistency between SegmentsImproving Optimization

In this talk, I will present two methods.

In the first problem, we try to matching feature point between images. This problem is called as spatial matching, because we need to find the correspondence in space.

In the second category, given several video and mocap sequence, we try to temporally align the sequence.

Lets first look at the first problem of space matching.

24

Backup Slides25

Linear Programming Approximation26

26

Previous Work on 3D-2D matchingPoint Set Matching3D-2D RenderingMotion History VolumeWeinland et al., 2007Epiploar GeometryRao et al, 2002ICP & RANSACGold et al., 1998, David et al., 2004

View-invariant RepresentationSensitive to NoiseBranch-BoundLi & Hartley, 2007Expensive to Optimize

Model RecommendationMatikainen et al., 2012Require Large Data

27Self-similarityJunejo et al., 2011

Now lets look at the second problem.

In this problem, we are trying to align multi-modal sequences in time.

By multi-modal, we mean the sequences are captured by different sensor.

For instance, suppose that we need to align three sequences of different subjects kicking a ball. In particular, the first sequence is a video. The second one is a mocap sequence. The last one is captured by accelerometers on the joint.

The goal here is to find the correspondence between frames.27