Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition

Low-Level Fusion of Audio and Video Feature for Multi-modal

Emotion Recognition

Chair for Image Understanding and Knowledge-based Systems

Institute for Informatics

Technische Universität München

Sylvia Pietzsch

[email protected]

2008, January 23rd 2/15Technische Universität MünchenSylvia Pietzsch

Overview Video low-level descriptors

Model-based image interpretation Structural features Temporal features

Audio low-level descriptors

Combining video and audio descriptors

Experimental results

Conclusion and outlook


Model-based Image Interpretation The model

The model contains a parameter vector that represents the model’s configuration.

The objective function Calculates a value that indicates how accurately a parameterized model matches an image.

The fitting algorithm Searches for the model parameters that describe the image best, i.e. it minimizes the objective function.


Local Objective Functions


Ideal Objective FunctionsP1: Correctness property:

Global minimum corresponds to the best fit.P2: Uni-modality property:

The objective function has no local extrema. ¬ P1 P1

¬P2

P2

Don’t exist for real-world images

Only for annotated images: fn( I , x ) = | cn – x |


Learning the Objective Function

x x x xx

xxx x xxx x x x

x x xx x

x xx x x x x

x xxx x

Ideal objective function generates training data Machine Learning technique generates calculation rules


Skin Color Extraction Location of contour

lines and skin colored parts

Adaptive to image context conditions

orig

ina

l

ima

ge

fixed

classifie

r

ad

ap

ted

classifie

r

Correctly detected pixels: fixed classifier: 90.4% 74.8% 40.2% adapted classifier: 97.5% 87.5% 97.0%


Structural Features Deformation parameters describe a distinctive

state of the face.


Temporal Features Facial expressions emerge from muscle activity.

Optical flow vectors are calculated at equally distributed feature points connected to the shape model.


Audio Low-level Descriptors Aiming at independence of phonetic content and speaker Coverage of prosodic, articulatory, and voice quality aspects 20ms frames, 50% overlap, Hamming window function

Zero crossing rate (ZCR) Pitch 7 formants Energy Spectral development Harmonics-to-Noise-Ratio (HNR) Durations of voiced sounds by HNR Durations of silences by bi-state energy

SMA filtering of LLDs Addition of 1st and 2nd order LLD regression coefficients


Combining Audio and Video LLDs Time series constructed for LLDs (audio, video

separately)

Application of functionals to combined low-level descriptors Linear moments (mean, std. deviation) Quartiles Durations

Resulting feature vector: 276 audio features 1048 video features

SVM


Experimental Results (1) Database: Airplane Behavior Corpus

Guided storyline 8 subjects (25 to 48 years old) 11.5 hours of video in total

10-fold stratisfied cross validation

Feature pre-selection by SVM-SFFS (sequential forward floating search)

Audio Video Audiovisual

Features [#] 92 156 200

Accuracy [%] 73.7 61.1 81.8


Experimental Results (2)

Main confusions: neutral, nervous cheerful, intoxicated

Aggressive behavior recognized best


Conclusion and Outlook Combined feature set superior over individual

audio or video feature set

Future work: Investigation on further data sets Comparison to late fusion approaches Performance of asynchronous feature fusion Application of hierarchical functionals


Thank you!

Documents

Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition