1 Articulated Pose Estimation in a Learned Smooth Space of Feasible Solutions Taipeng Tian, Rui Li...

Preview:

Citation preview

1

Articulated Pose Estimation in a Learned Smooth Space of

Feasible Solutions

Taipeng Tian, Rui Li and Stan Sclaroff

Computer Science Dept.

Boston University

2

Introduction

• Motivating application– Gesture Recognition– Fixed Gesture Lexicon.– For example :

Aircraft Signaler hand gestures

Traffic Controllerhand Signals

Basketball Refereehand Signals

3

Pose

Estimation

Problem Definition

2D Projected Marker Positions

Input (Observation) Output

Silhouette(Alt Moments)

4

Related Work : Pose Estimation from a Single Image

• Geometry Based – Taylor CVIU ’01– Barron & Kakadiaris IVC ’01– Parameswaran & Chellappa CVPR ‘04

• Learning Based– Rosales & Sclaroff HUMO ’00– Agarwal & Triggs CVPR ’04

• Others– Lee & Cohen CVPR ’04– Shakhnarovich, Viola, Darrell ICCV ’03– Mori, Ren, Efros and Malik CVPR ‘04– Many more …

5

Idea 1 : Learning Mappings

• Specialized Mapping Architechture (SMA)[Rosales and Sclaroff NIPS ‘01]

• Relevance Vector Regression[Agarwal and Triggs CVPR ‘04]

Image Features

Pose

6

Idea 1 : Learning Mappings

• Specialized Mapping Architechture (SMA)[Rosales and Sclaroff NIPS ‘01]

• Relevance Vector Regression[Agarwal and Triggs CVPR ‘04]

Image Features

Pose

7

Idea 2 : Exploring the Solution Space

• Simulated Annealing[Deutscher et al. CVPR ’00]

• Monte Carlo Markov Chain[Lee and Cohen CVPR ‘04]

• etc …

8

Idea 2 : Exploring the Solution Space

• Simulated Annealing[Deutscher et al. CVPR ’00]

• Monte Carlo Markov Chain [Lee and Cohen CVPR ‘04]

• etc …

• Accurate model and typically with high DOF.

• Exploring the pose space for a solution consistent with observations.

• Difficult for high DOF.

• Computationally intensive.

9

Key Observations

• We have a constrained set of poses.• Not necessary to explore the full parameter space.• Combine two ideas

– Learn Mappings– Explore a constrained space (i.e. learned model of body poses)

Aircraft Signaler hand gestures

Traffic Controllerhand Signals

Basketball Refereehand Signals

10

Overview of Framework

Learn the rendering function Φ(.)

Learn a model of human body poses1

2

Y: Training DataLearning Phase

)(||||min 12 yx,sΦ(y)

yx,YL

Pose Inference PhaseInput Silhouette Output Pose

X: Latent Space

11

Learning a Model of Human Poses

• Gaussian Process Latent Variable Model (GPLVM) [Neil Lawrence NIPS ’04] is used.

• GPLVM originally used for visualizing high dimensional data

• Grochow et al. (SIGGRAPH ’03) uses it to solve the inverse kinematics problem for human motion animation.

• Currently we use it for automated articulated body pose inference

12

Gaussian Process Latent Variable

Model(GPLVM) Overview

Higher Dimensional

Lower Dimensional / Latent Space

Probabilistic Mapping

y

x

13

GPLVM Training : Learning a Model of Body Poses

• Given : training set of 2D projected marker positions {yi} (each yi is of D dimension)

• Goal : Learn parameters ,,},{ ix

Corresponding latent variable valuesfor each training data point

Variables related to the Kernel

14

Kernel Function

• Also known as covariance function.• Measures the similarity of the latent

variables x and x’.

• For a data set of size N, we form an N by N kernel matrix K, in which Ki,j = k(xi, xj).

1',

2'

2-exp )',(

xxxxxxk

15

• For a single dimension, the likelihood of y given the Gaussian Process (GP) model parameters is:

• Joint likelihood for D dimensions is:

dTdNdidip YKY

Kxy 1

,, 2

1exp

||)2(

1),,},{|}({

D

d idiii pp1 , ),,},{|}({),,},{|}({ xyxy

GPLVM Training : Learning a Model of Body Poses

16

}){|,,},({ iip yx

)ln(||||2

1

2

1||ln

221

ii

Td

d

Td

DxYKYK

To learn GPLVM from the training set {yi}, we maximize the following posterior:

And placing the priors

)|()( I0xx ,Np 1

),,( p

Negative Log

17

}){|,,},({ iip yx

)ln(||||2

1

2

1||ln

221

ii

Td

d

Td

DxYKYK

To learn GPLVM from the training set {yi}, we maximize the following posterior:

Negative Log

Computationally Intensive. A subsetis chosen to compute the kernel matrix.This subset of poses is called the ActiveSet.

18

• For a new pair (x,y) we can predict using

222

2

||||2

1)(ln

2)(2

||)f(||

),,,},{|},({ln),(

xxx

xy

xxyyyx

D

pL iiY

)},{|,,,},({ y'yx'x iip

• This eqn. can be used to solve for x given y or y given x, via gradient descent.

19

GPLVM1x

2x

20

GPLVM1x

2x

21

GPLVMLeft hand raised silhouettes tend to be clustered together

22

GPLVMDoes not always do a good job

23

About GPLVM

• Allows mapping to and from the lower dimensional space.

• Allows smooth parameterization (i.e. allows derivatives) in lower dimensional space.

• Two dimensions work well for our data set. (Growchow et al. uses 2-5)

24

Input2D Pose

Silhouettes (Represented using Alt Moments)

Learning the Forward/Rendering Function

Similar to Rosales and Sclaroff

25

Overview of Framework

Learn the rendering function Φ(.)

Learn a model of human body poses1

2

Y: Training DataLearning Phase

)(||||min 12 yx,sΦ(y)

yx,YL

Pose Inference PhaseInput Silhouette Output Pose

X: Latent Space

26

Pose Inference

21

2 ||||||||min ysΦ(y)y

Typical Regularization(Also used by Agarwal and Triggs)

27

Data Term

21

2 ||||||||min ysΦ(y)y

Forward function (Rendering function)

2D Projected Marker Positions

Silhouette(Alt Moments)

28

Regularization Term

21

2 ||||||||min ysΦ(y)yx,

Replace with prior knowledge term(i.e the learned model of poses)

)(1 yx,YLIndependent of feature s

29

Pose Inference

)(||||min 12 yx,sΦ(y)

yx,YL

Solution obtained using Conjugate Gradient- Initialization using Active Set

30

Data Collection

• 12 gestures in the flight director lexicon

• Synthesize 6000 pairs of (Silhouette, Pose) pairs using Poser

• 3000 training (Male model)

• 3000 testing (Female model)

3D Pose

Synthesized Silhouettes sampledUniformly over the frontal view-sphere

31

(a) Silhouette images generated by Poser 5 (Test Set)

Experiments (Synthetic Data)

(c) Our Approach

(b) Estimation from SMA (Specialized Mapping Architecture)

(d) Ground Truth

32

Comparison with SMA

33

Additional Constraints

212 |||| tt yy

)(||||min 12

ttYt Ltt

y,xs)Φ(yy,x

Additional constraints can be added to achieve more accurate estimate, e.g. temporal consistency

34

Experiments (Real Data)

(d) Our Approach (With Temporal Consistency)

(a) Silhouette images of real person

(b) SMA (Specialized Mapping Architecture)

(c) Our Approach (Without Temporal Consistency)

35

Experiments (Real Data)

(a) Silhouette images of real person

(b) SMA (Specialized Mapping Architecture)

(c) Our Approach (Without Temporal Consistency)

(d) Our Approach (With Temporal Consistency)

36

Conclusion• Proposed a novel method for Pose

estimation for a pre-defined gesture lexicon.

• Interesting to note that two dimension is enough in our case.

• Technique is fast. (about 0.1 sec per frame in Matlab)

• Tracking as an extension. [video]

37

Thank You

38

Comments after the talk• Related Works

– Bullets / Summary of Strength vs Weakness– Why we need this work?

• Include year of publication for the related work (eg Rosales Sclaroff work not mentioned, Smichisecu work not mentioned)

• Order the related work temporally?• Include an introduction slide and motivating slide

– How to Motivate this work?– State of the art is so and so… We found this common weakness. So we proposed this

work..• Human Pose not mentioned in Intro• At the end of the talk say why use this work over the others• Why GPLVM and not other reduction techniques? Like LLE/PCA/ISOMAP etc• Give a top overview of the algorithm. A flow chart view?• Explain the L(x,y) mapping using an illustration like the mapping between two planes.

Clearly say what is high dimension y and what is low dimension x• Give reference for GPLVM or website link.• Add a slide on Math of GPLVM• The Tikhonov regularization approach of minimizing ||phi(y)-s|| + regularization term.

Usually the regularization term is ||Dx|| but now we chose L(x,y). Explain why• Slide to talk about temporal constraint.• Why learn the rendering function? i.e because we want to take the derivative…• Give the numbers for the training set and this gives an idea how good are the

quantitative results

39

Related Work

Model Based• Simulated Annealing

[Deutscher et al CVPR ’00]

• Kinematic Jump Processes[Sminchisescu and Triggs CVPR ’03]

• Monte Carlo Markov Chain [Lee and Cohen CVPR ‘04]

• etc …

Learning Based• Specialized Mapping

Architechture (SMA)[Rosales and Sclaroff NIPS ‘01]

• Relevance Vector Regression[Agarwal and Triggs CVPR ‘04]

• Parameter Sensitive Hashing[Shakhnarovich et al CVPR ‘03 ]

• etc …

40

}){|,,},({ iip yx

)ln(||||2

1

2

1||ln

221

ii

Td

d

Td

DL xYKYK

To learn GPLVM from the training set {yi}, we maximize the following posterior:

Negative Log

41

Overview of Framework (Learning Phase)

Learn the Rendering Function Φ(.)

Learning a model of human body poses(Using GPLVM)1 2

42

Overview of Framework (Estimation Phase)

Input Silhouette

Output Pose

Search over learned model of human body pose for solution consistent with observation

43

Kernel Function

• measures the similarity of the latent variables x and x’.

• For a data set of N, we can form a N by N kernel matrix K, in which Ki,j = k(xi, xj).

1',

2'

2-exp )',(

xxxxxxk

how correlated x, x’ are in general spread of the

functionnoise in the prediction

44

}){|,,},({ iip yx

To learn the parameters of the GPLVM from the training set {yi}, we maximize the following posterior:

And placing the priors

)|()( I0xx ,Np

GPLVM Training : Learning a Model of Body Poses

1

),,( p

45

Gaussian Process Latent Variable Model(GPLVM)

)( yx,YL

Low dimensional parameterization

Original space representationExpress how well

the two value matches

Space of FeasiblePoses

46

• For a new pair (x,y) we can predict using

222

2

||||2

1)(ln

2)(2

||)f(||

),,,},{|},({ln),(

xxx

xy

xxyyyx

D

pL iiY

)},{|,,,},({ y'yx'x iip

)()( 1 xT kKYxf )()(),()( 12 xKxxxx kkk T