39
Looking at / Sensing people Ioannis Patras www.eecs.qmul.ac.uk/~ioannisp Centre for Intelligent Sensing Queen Mary University of London

Looking at / Sensing peoplecis.eecs.qmul.ac.uk/201303_Kyoto_presentations/... · 2018. 8. 21. · Fusion of different types of useful privileged information 3 ... - 7 facial expressions:

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • Looking at / Sensing people

    Ioannis Patras

    www.eecs.qmul.ac.uk/~ioannisp

    Centre for Intelligent Sensing

    Queen Mary University of London

  • Related research

    • Scene analysis

    Object recognition / Semantic segmentation

    • Motion Analysis

    Motion estimation / segmentation

    Object Tracking

    • Facial (Expression) Analysis

    Head tracking/Facial Feature Tracking

    Facial expression recognition

    • Action / Gesture Recognition

    Spatio-temporal representations for action recognition

    Pose estimation

    • Brain Computer Interfaces

    Dynamic Vision

    Looking at / sensing people

    Static Analysis

  • Looking at/sensing people

    • Facial (Expression) Analysis

    Head tracking/Facial Feature Tracking

    Facial expression recognition

    • Action / Gesture Recognition

    Action recognition and localisation

    Pose estimation

    • Brain Computer Interfaces

  • Introduction

    Motivation

    Vision-based analysis and understanding of human activities becomes of paramount importance in a world centered around at humans and overwhelmed with visual data.

    Challenges

    Detection, tracking, understanding

    Applications

    Visual Surveillance, Human Machine/Robot Interaction, Intelligent Systems, Multimedia Analysis, Ambient Intelligence

    Related expertise

    Computer Vision, Pattern Recognition, AI

  • 5

    Recognition and Localisation of Actions

    Goal:

    Recognize categories of

    actions

    Localize them in terms of their

    bounding box (space +

    time)

    Challenges:

    Occlusions, clutter, variations,

    Hypothesis: Analysis can be restricted on a set of

    spatiotemporally ‘interesting’/salient events

  • Implicit Shape Model (ISM) Input training patches

    Clustering

    Each codeword is associated with a vote map that gives

    the possible location of the hypothesis centre

    Codewords

    offsets Codeword

    center

    D1

    D2

    D3

    Appearence space

  • Implicit Shape Model (ISM)

    S1

    S2

    S3

    D1

    D3

    D2

    Appearence space voting maps

    xi

    ii

    SN

    S1

    Output Hough space

  • Implicit Shape Model (ISM)

    S1

    S2

    S3

    D1

    D3

    D2

    Appearence space voting space

    ii

    SN

    S1

    l

    l

    SN

    1

    xl

    Output Hough space

  • Implicit Shape Model (ISM)

    S1

    S2

    S3

    D1

    D3

    D2

    Appearence space voting space

    Hypothesis center

    ii

    SN

    S1

    l

    l

    SN

    1

    Output Hough space

  • Discriminative learning

    • Higher weights for pdfs with higher

    localisation accuracy

    • Class dictionary comprise of

    discriminative codewords •Adaboost on the codeword similarities

    iii

    cpcpdw |log|exp( icp |

  • Discriminative Voting Score

    Yc : An area around the hypothesis

    centre of the training image

    Let S(y) denote the probabilistic

    Hough score at location y

    The discriminative voting score:

    Objective : Maximize discriminative voting score for the

    training set

    ccYyYy

    ySyS )()(

    YC

    Output Hough space

    Local feature

  • Goal : Learn a task-dependant dictionary for

    localization of actions

    ISM N

    i 1,

    iilx

    S(y)

    y yc

    cY

    c cYy Yy

    iiySyS )()(

    12

  • 13

    Action recognition

    • KTH dataset – average : 88% • HoHA dataset – average : 37%

  • Artificial occlusions and clutter

  • Detection Results

    15

  • Regression Forests for Facial Analysis

    H. Yang, I. Patras, ACCV2012]

    [H. Yang, I. Patras, IEEE FG 2013]

    Input test point

    Split function

    at node

    Input training data

  • Regression Forests-Review In previous methods for multiple targets (parts)

    regression, each part is regressed separately, ignoring the interdependency.

    𝑌 = {𝑦1, … , 𝑦𝑖 , … , yn} 𝑦𝑖

    𝑥𝑖

    𝑦𝑖

    𝑥𝑗

    𝑦𝑗

    Independency

    assumption in

    previous methods

    𝑝(𝑦

    𝑖|x)

    𝑦𝑖

    𝑝(𝑦

    𝑗|x)

    𝑦𝑗

    𝑋 :the set of all image patches 𝑋𝑖: the set of image patches that are able to vote for point 𝑖

    Some

    non-plausible

    results from the

    paper .

    [Dantone et al. CVPR12 ]

    𝑝 𝑦𝑖 𝑋) = 𝑝(𝑦𝑖|𝑥)

    𝑥∈𝑋𝑖

    𝑝 𝑦𝑗 𝑋) = 𝑝(𝑦𝑖|𝑥)

    𝑥∈𝑋𝑗

  • SORF

    x17 x18

    x19 x16

    7

    10 1

    2 3 4 5 6

    8 9

    11 12 13

    14 15

    16 17 18

    19 20

    7 10

    1 2

    3 4 5 6

    8 9

    11

    12 13

    14 15

    16 17 18

    19 20

    1 2

    3 4 5 6

    8 9

    11

    12 13

    14 15

    16 17 18

    19 20

    1 2

    3 4 5 6

    8 9

    11

    12 13

    14 15

    16 17 18

    19 20

    7 10

    a b c x17 x18 x19 x16

    d

    𝑝 𝑦𝑗 𝑦17, 𝑙 𝑗 ∈ 𝑁𝑒(17)

    𝑝(𝑦17|𝑙)

    1

    2

    3 4 5

    6

    8

    9

    11

    12

    13

    14 15

    16 17

    18

    19

    20

    𝑝(𝑦16|𝑙) 𝑝(𝑦19|𝑙) 𝑝(𝑦18|𝑙)

    𝑝 𝑦𝑗 𝑦16, 𝑙 𝑗 ∈ 𝑁𝑒(16)

    𝑝 𝑦𝑗 𝑦19, 𝑙 𝑗 ∈ 𝑁𝑒(19)

    𝑝 𝑦𝑗 𝑦18, 𝑙 𝑗 ∈ 𝑁𝑒(18)

    10

    7

    a. A predefined

    graph model

    b. Local graph

    related to P17

    c. Traditional

    independent

    voting

    d. SO voting

    Models learning

    Regression models for the base point and its neighbors

  • SORF

    (relative offsets vectors and weights)

    𝑝(𝑦𝑖|𝑙) {Δ𝑙𝑖𝑘 , 𝜔𝑙𝑖𝑘}

    𝑑𝑗𝑙

    𝑑𝑖𝑙

    Regression model for base point [Mean Shift method like Sun et al. CVPR2012]

    Gaussian shape model

    𝑥𝑖 Feature patch from position 𝑦 𝑖0 reaches leaf 𝑙, then the absolute vote position for 𝑦𝑖 is 𝑦 𝑖 = 𝑦 𝑖

    0 + Δ𝑙𝑖𝑘 . To aggregate the votes:

    𝑥𝑖

    𝑝 𝑦𝑖 𝑦𝑗 , 𝑙 = 𝑁(𝑑𝑙𝑖 − 𝑑𝑙

    𝑗|Δ𝑗𝑙

    𝑖 , Λ𝑗𝑙𝑖 )

    𝑝 𝑦𝑖 𝑥𝑖 = 𝜔𝑙𝑖𝑘 exp(−𝑦𝑖 − 𝑦 𝑖

    ℎ𝑖2

    2

    )

    𝑝 𝑦𝑗 𝑦𝑖 = 𝐾𝑦𝑗 − (𝑦 𝑖 + Δ𝑖𝑙

    𝑗)

    ℎ𝑖𝑗

    Training

    Inference

  • SORF

    𝑝 𝑦𝑖 𝑋 = 𝑝 𝑦𝑖 𝑥

    𝑥∈𝑋𝑖

    ∗ 𝑝(𝑦𝑖|𝑦𝑗𝑗∈𝑁𝑒(𝑖)

    ) 𝑥𝑖

    𝑦𝑖

    𝑥𝑗

    𝑦𝑗

    Face bounding box

    = X

  • SORF vs. RF on BioID dataset

  • Slide 22

    Privileged Information CRF

    1 2 1 1 2 2

    1 2

    1 2

    1 1 1 1 1 2 2 2 𝜙∗ = argmax 𝐼𝐺𝑦∗ (𝜙) 𝜙

    ∗ = argmax 𝐼𝐺𝑦 (𝜙)

    RF PI-RF

    (𝑥, 𝑦∗, 𝑦)

    Training Patches

    𝑖𝑡ℎ point offsets

    𝑦𝑘∗=

    𝑙

    Three models at each leaf node:

    1. 𝑝 𝑦𝑘∗ 𝑙 =𝑛𝑘

    𝑛

    2. 𝑝 𝑦𝑖 𝑦𝑘∗, 𝑙 = Δ𝑖𝑙

    𝑘 , 𝜔𝑖𝑙𝑘

    3. 𝑝 𝑦𝑗 − 𝑦𝑖 𝑦𝑘∗, 𝑙 =

    𝑁 𝑑𝑗 − 𝑑𝑖|𝜇𝑖𝑗, Σ𝑖

    𝑗

    Privileged information is only available during training, used for:

    1. tree growing

    2. conditional models learning (similar to CRF in [Sun et al. CVPR2012] )

  • Slide 23

    PI-CRF: Inference PI is only available during training, used for:

    The privileged information is estimated first.

    𝑝 𝑦𝑘∗ 𝑋 = 𝑝(𝑦𝑘∗|𝑙)

    𝑙∈𝐿𝑥𝑥∈𝑋

    𝐿𝑥 is the set of leaf nodes the patch 𝑥 arrived.

    𝑦 𝑘∗ = argmax𝑦𝑘∗∈𝑌𝑘∗

    𝑝(𝑦𝑘∗|𝑋)

    Then 𝑦 𝑘∗ is used in subsequent steps to select the regression and shape models.

    𝑝 𝑦𝑖 𝑦𝑘∗, 𝑙 = Δ𝑖𝑙

    𝑘 , 𝜔𝑖𝑙𝑘

    𝑝 𝑦𝑗 − 𝑦𝑖 𝑦𝑘∗, 𝑙 = 𝑁 𝑑𝑗 − 𝑑𝑖|𝜇𝑖

    𝑗, Σ𝑖

    𝑗

    of which the 𝑦𝑘∗ = 𝑦 𝑘∗.

  • Slide 24

    PI-CRF: Experiments

    Experiments of three types of privileged information are conducted on LFW dataset,

    Head pose yaw and roll show promising improvement while gender information does not.

    [LFW database: http://vis-www.cs.umass.edu/lfw/]

  • Slide 25

    PI-CRF: Experiments

    The first row shows detection results of [Dantone et al. CVPR2012]

    The second row shows our detection results [Yang and Patras. IEEE FG 2013].

  • Slide 26

    PI-CRF: Experiments

    The advantages of our method:

    1. Shared tree structure and

    no need to train an

    additional forest for head

    pose estimation

    2. Fusion of different types of

    useful privileged

    information

    3. Taking structure into

    account

    4. Better performance

    Overall performance on the LFW

  • References

    J. Shotton, et al. Efficient regression of general-activity human poses from depth images. ICCV2011.

    J. Shotton et al. Real-time human pose recognition in parts from single depth images. CVPR2011 .

    A. Criminisi et al. Regression forests for efficient anatomy detection and localization in CT studies. Medical Computer Vision.

    Recognition Techniques and Applications in Medical Imaging 2011.

    M. Sun et al. Conditional regression forests for human pose estimation. CVPR2012

    M. Dantone et al. Real-time facial feature detection using conditional regression forests. CVPR2012.

    M. Sun et al. Conditional Regression Forests for Human Pose Estimation. CVPR2012.

    H. Yang, I. Patras, Face Parts Localization using structured output regression forests, ACCV 2012

    H. Yang, I. Patras, Privileged information-based conditional regression forests for facial feature detection, IEEE FG 2013.

  • Anger Surprise Sadness Disgust Fear Happiness

    Pose-invariant Facial Expression Recognition

    Rudovic, Patras, Pantic, ECCV 2010

    Rudovic, Patras, Pantic, TPAMI )To appear)

  • Pose-invariant FER: Our Approach

    FACIAL EXPRESSION

    CLASSIFICATION

    HEAD POSE ESTIMATION POSE NORMALIZATION

    Emotion:

    SURPRISE

    Emotion:

  • Pose-invariant FER: Pose Normalization

    FACIAL EXPRESSION

    CLASSIFICATION

    HEAD POSE ESTIMATION POSE NORMALIZATION

    Emotion:

    SURPRISE

    Emotion:

  • Experiments conducted on the BU3FE and Multi-PIE database

    Input to the system: the position of 39 facial landmarks.

    Overview of the conducted experiments:

    1. BU3FE: evaluation of CGPR model in terms of (a) head-pose-normalization accuracy, (b) robustness to noise, and (c) facial expression classification (balanced dataset).

    2. MultiPIE: evaluation of CGPR model in terms of (a) head-pose-normalization accuracy, and (b) facial expression classification (unbalanced dataset).

    Pose-invariant FER: Experiments

  • Pose-invariant FER: Experiments

    - tp

    - ntp

    - 7 facial expressions:

    Surprise

    Anger

    Happiness

    Neutral

    Disgust

    Fear

    Sadness

    - 247 poses (35 training)

    - 5-fold person-indep.

    cross validation

    - 50 subjects (54%

    female)

    (+45,+30)

    (-45,-30)

  • Pose-invariant FER: Experiments

  • Pose-invariant FER: Experiments

  • Implicit tagging via EEG and Face Analysis

    Recognition of Affective States: Arousal, Valence, Control

  • Implicit tagging via EEG and Face Analysis

  • Implicit tagging via EEG and Face Analysis

  • Implicit tagging via EEG and Face Analysis

  • Possible collaborations

    • Facial (Expression) Analysis

    • Body gesture analysis (pose estimation, tracking, action

    recognition)

    • Multimodal analysis for affect recognition