Looking at / Sensing peoplecis.eecs.qmul.ac.uk/201303_Kyoto_presentations/... · 2018. 8. 21. · Fusion of different types of useful privileged information 3 ... - 7 facial expressions:

Looking at / Sensing people

Ioannis Patras

www.eecs.qmul.ac.uk/~ioannisp

Centre for Intelligent Sensing

Queen Mary University of London

Related research

• Scene analysis

Object recognition / Semantic segmentation

• Motion Analysis

Motion estimation / segmentation

Object Tracking

• Facial (Expression) Analysis

Head tracking/Facial Feature Tracking

Facial expression recognition

• Action / Gesture Recognition

Spatio-temporal representations for action recognition

Pose estimation

• Brain Computer Interfaces

Dynamic Vision

Looking at / sensing people

Static Analysis

Looking at/sensing people


Head tracking/Facial Feature Tracking

Facial expression recognition

• Action / Gesture Recognition

Action recognition and localisation

Pose estimation

• Brain Computer Interfaces

Introduction

Motivation

Vision-based analysis and understanding of human activities becomes of paramount importance in a world centered around at humans and overwhelmed with visual data.

Challenges

Detection, tracking, understanding

Applications

Visual Surveillance, Human Machine/Robot Interaction, Intelligent Systems, Multimedia Analysis, Ambient Intelligence

Related expertise

Computer Vision, Pattern Recognition, AI

5

Recognition and Localisation of Actions

Goal:

Recognize categories of

actions

Localize them in terms of their

bounding box (space +

time)

Challenges:

Occlusions, clutter, variations,

…

Hypothesis: Analysis can be restricted on a set of

spatiotemporally ‘interesting’/salient events

Implicit Shape Model (ISM) Input training patches

Clustering

Each codeword is associated with a vote map that gives

the possible location of the hypothesis centre

Codewords

offsets Codeword

center

D1

D2

D3

Appearence space

Implicit Shape Model (ISM)

S1

S2

S3

D1

D3

D2

Appearence space voting maps

xi

ii

SN

S1

Output Hough space


S1

S2

S3

D1

D3

D2

Appearence space voting space

ii

SN

S1

l

l

SN

1

xl

Output Hough space


S1

S2

S3

D1

D3

D2

Appearence space voting space

Hypothesis center

ii

SN

S1

l

l

SN

1

Output Hough space

Discriminative learning

• Higher weights for pdfs with higher

localisation accuracy

• Class dictionary comprise of

discriminative codewords •Adaboost on the codeword similarities

iii

cpcpdw |log|exp( icp |

Discriminative Voting Score

Yc : An area around the hypothesis

centre of the training image

Let S(y) denote the probabilistic

Hough score at location y

The discriminative voting score:

Objective : Maximize discriminative voting score for the

training set

ccYyYy

ySyS )()(

YC

Output Hough space

Local feature

Goal : Learn a task-dependant dictionary for

localization of actions

ISM N

i 1,

iilx

S(y)

y yc

cY

c cYy Yy

iiySyS )()(

12

13

Action recognition

• KTH dataset – average : 88% • HoHA dataset – average : 37%

Artificial occlusions and clutter

Detection Results

15

Regression Forests for Facial Analysis

H. Yang, I. Patras, ACCV2012]

[H. Yang, I. Patras, IEEE FG 2013]

Input test point

Split function

at node

Input training data

Regression Forests-Review In previous methods for multiple targets (parts)

regression, each part is regressed separately, ignoring the interdependency.

𝑌 = {𝑦1, … , 𝑦𝑖 , … , yn} 𝑦𝑖

𝑥𝑖

𝑦𝑖

𝑥𝑗

𝑦𝑗

Independency

assumption in

previous methods

𝑝(𝑦

𝑖|x)

𝑦𝑖

𝑝(𝑦

𝑗|x)

𝑦𝑗

𝑋 ：the set of all image patches 𝑋𝑖: the set of image patches that are able to vote for point 𝑖

Some

non-plausible

results from the

paper .

[Dantone et al. CVPR12 ]

𝑝 𝑦𝑖 𝑋) = 𝑝(𝑦𝑖|𝑥)

𝑥∈𝑋𝑖

𝑝 𝑦𝑗 𝑋) = 𝑝(𝑦𝑖|𝑥)

𝑥∈𝑋𝑗

SORF

x17 x18

x19 x16

7

10 1

2 3 4 5 6

8 9

11 12 13

14 15

16 17 18

19 20

7 10

1 2

3 4 5 6

8 9

11

12 13

14 15

16 17 18

19 20

1 2

3 4 5 6

8 9

11

12 13

14 15

16 17 18

19 20

1 2

3 4 5 6

8 9

11

12 13

14 15

16 17 18

19 20

7 10

a b c x17 x18 x19 x16

d

𝑝 𝑦𝑗 𝑦17, 𝑙 𝑗 ∈ 𝑁𝑒(17)

𝑝(𝑦17|𝑙)

1

2

3 4 5

6

8

9

11

12

13

14 15

16 17

18

19

20

𝑝(𝑦16|𝑙) 𝑝(𝑦19|𝑙) 𝑝(𝑦18|𝑙)

𝑝 𝑦𝑗 𝑦16, 𝑙 𝑗 ∈ 𝑁𝑒(16)

𝑝 𝑦𝑗 𝑦19, 𝑙 𝑗 ∈ 𝑁𝑒(19)

𝑝 𝑦𝑗 𝑦18, 𝑙 𝑗 ∈ 𝑁𝑒(18)

10

7

a. A predefined

graph model

b. Local graph

related to P17

c. Traditional

independent

voting

d. SO voting

Models learning

Regression models for the base point and its neighbors

SORF

(relative offsets vectors and weights)

𝑝(𝑦𝑖|𝑙) {Δ𝑙𝑖𝑘 , 𝜔𝑙𝑖𝑘}

𝑑𝑗𝑙

𝑑𝑖𝑙

Regression model for base point [Mean Shift method like Sun et al. CVPR2012]

Gaussian shape model

𝑥𝑖 Feature patch from position 𝑦 𝑖0 reaches leaf 𝑙, then the absolute vote position for 𝑦𝑖 is 𝑦 𝑖 = 𝑦 𝑖

0 + Δ𝑙𝑖𝑘 . To aggregate the votes:

𝑥𝑖

𝑝 𝑦𝑖 𝑦𝑗 , 𝑙 = 𝑁(𝑑𝑙𝑖 − 𝑑𝑙

𝑗|Δ𝑗𝑙

𝑖 , Λ𝑗𝑙𝑖 )

𝑝 𝑦𝑖 𝑥𝑖 = 𝜔𝑙𝑖𝑘 exp(−𝑦𝑖 − 𝑦 𝑖

ℎ𝑖2

2

)

𝑝 𝑦𝑗 𝑦𝑖 = 𝐾𝑦𝑗 − (𝑦 𝑖 + Δ𝑖𝑙

𝑗)

ℎ𝑖𝑗

Training

Inference

SORF

𝑝 𝑦𝑖 𝑋 = 𝑝 𝑦𝑖 𝑥

𝑥∈𝑋𝑖

∗ 𝑝(𝑦𝑖|𝑦𝑗𝑗∈𝑁𝑒(𝑖)

) 𝑥𝑖

𝑦𝑖

𝑥𝑗

𝑦𝑗

Face bounding box

= X

SORF vs. RF on BioID dataset

Privileged Information CRF

1 2 1 1 2 2

1 2

1 2

1 1 1 1 1 2 2 2 𝜙∗ = argmax 𝐼𝐺𝑦∗ (𝜙) 𝜙

∗ = argmax 𝐼𝐺𝑦 (𝜙)

RF PI-RF

(𝑥, 𝑦∗, 𝑦)

Training Patches

𝑖𝑡ℎ point offsets

𝑦𝑘∗=

𝑙

Three models at each leaf node:

1. 𝑝 𝑦𝑘∗ 𝑙 =𝑛𝑘

𝑛

2. 𝑝 𝑦𝑖 𝑦𝑘∗, 𝑙 = Δ𝑖𝑙

𝑘 , 𝜔𝑖𝑙𝑘

3. 𝑝 𝑦𝑗 − 𝑦𝑖 𝑦𝑘∗, 𝑙 =

𝑁 𝑑𝑗 − 𝑑𝑖|𝜇𝑖𝑗, Σ𝑖

𝑗

Privileged information is only available during training, used for:

1. tree growing

2. conditional models learning (similar to CRF in [Sun et al. CVPR2012] )

PI-CRF: Inference PI is only available during training, used for:

…

The privileged information is estimated first.

𝑝 𝑦𝑘∗ 𝑋 = 𝑝(𝑦𝑘∗|𝑙)

𝑙∈𝐿𝑥𝑥∈𝑋

𝐿𝑥 is the set of leaf nodes the patch 𝑥 arrived.

𝑦 𝑘∗ = argmax𝑦𝑘∗∈𝑌𝑘∗

𝑝(𝑦𝑘∗|𝑋)

Then 𝑦 𝑘∗ is used in subsequent steps to select the regression and shape models.

𝑝 𝑦𝑖 𝑦𝑘∗, 𝑙 = Δ𝑖𝑙

𝑘 , 𝜔𝑖𝑙𝑘

𝑝 𝑦𝑗 − 𝑦𝑖 𝑦𝑘∗, 𝑙 = 𝑁 𝑑𝑗 − 𝑑𝑖|𝜇𝑖

𝑗, Σ𝑖

𝑗

of which the 𝑦𝑘∗ = 𝑦 𝑘∗.

PI-CRF: Experiments

Experiments of three types of privileged information are conducted on LFW dataset,

Head pose yaw and roll show promising improvement while gender information does not.

[LFW database: http://vis-www.cs.umass.edu/lfw/]

PI-CRF: Experiments

The first row shows detection results of [Dantone et al. CVPR2012]

The second row shows our detection results [Yang and Patras. IEEE FG 2013].

PI-CRF: Experiments

The advantages of our method:

1. Shared tree structure and

no need to train an

additional forest for head

pose estimation

2. Fusion of different types of

useful privileged

information

3. Taking structure into

account

4. Better performance

Overall performance on the LFW

References

J. Shotton, et al. Efficient regression of general-activity human poses from depth images. ICCV2011.

J. Shotton et al. Real-time human pose recognition in parts from single depth images. CVPR2011 .

A. Criminisi et al. Regression forests for efficient anatomy detection and localization in CT studies. Medical Computer Vision.

Recognition Techniques and Applications in Medical Imaging 2011.

M. Sun et al. Conditional regression forests for human pose estimation. CVPR2012

M. Dantone et al. Real-time facial feature detection using conditional regression forests. CVPR2012.

M. Sun et al. Conditional Regression Forests for Human Pose Estimation. CVPR2012.

H. Yang, I. Patras, Face Parts Localization using structured output regression forests, ACCV 2012

H. Yang, I. Patras, Privileged information-based conditional regression forests for facial feature detection, IEEE FG 2013.

Anger Surprise Sadness Disgust Fear Happiness

Pose-invariant Facial Expression Recognition

Rudovic, Patras, Pantic, ECCV 2010

Rudovic, Patras, Pantic, TPAMI )To appear)

Pose-invariant FER: Our Approach

FACIAL EXPRESSION

CLASSIFICATION

HEAD POSE ESTIMATION POSE NORMALIZATION

Emotion:

SURPRISE

Emotion:

Pose-invariant FER: Pose Normalization

FACIAL EXPRESSION

CLASSIFICATION

HEAD POSE ESTIMATION POSE NORMALIZATION

Emotion:

SURPRISE

Emotion:

Experiments conducted on the BU3FE and Multi-PIE database

Input to the system: the position of 39 facial landmarks.

Overview of the conducted experiments:

1. BU3FE: evaluation of CGPR model in terms of (a) head-pose-normalization accuracy, (b) robustness to noise, and (c) facial expression classification (balanced dataset).

2. MultiPIE: evaluation of CGPR model in terms of (a) head-pose-normalization accuracy, and (b) facial expression classification (unbalanced dataset).

Pose-invariant FER: Experiments


- tp

- ntp

- 7 facial expressions:

Surprise

Anger

Happiness

Neutral

Disgust

Fear

Sadness

- 247 poses (35 training)

- 5-fold person-indep.

cross validation

- 50 subjects (54%

female)

(+45,+30)

(-45,-30)

Implicit tagging via EEG and Face Analysis

Recognition of Affective States: Arousal, Valence, Control

Implicit tagging via EEG and Face Analysis

Possible collaborations


• Body gesture analysis (pose estimation, tracking, action

recognition)

• Multimodal analysis for affect recognition

Documents

Looking at / Sensing peoplecis.eecs.qmul.ac.uk/201303_Kyoto_presentations/... · 2018. 8. 21. · Fusion of different types of useful privileged information 3 ... - 7 facial expressions: