Upload
ulric
View
55
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Low-Level Fusion of Audio and Video Feature for Multi-modal Emotion Recognition. Chair for Image Understanding and Knowledge-based Systems Institute for Informatics Technische Universität München Sylvia Pietzsch [email protected]. Overview. Video low-level descriptors - PowerPoint PPT Presentation
Citation preview
Low-Level Fusion of Audio and Video Feature for Multi-modal
Emotion Recognition
Chair for Image Understanding and Knowledge-based Systems
Institute for Informatics
Technische Universität München
Sylvia Pietzsch
2008, January 23rd 2/15Technische Universität MünchenSylvia Pietzsch
Overview Video low-level descriptors
Model-based image interpretation Structural features Temporal features
Audio low-level descriptors
Combining video and audio descriptors
Experimental results
Conclusion and outlook
2008, January 23rd 3/15Technische Universität MünchenSylvia Pietzsch
Model-based Image Interpretation The model
The model contains a parameter vector that represents the model’s configuration.
The objective function Calculates a value that indicates how accurately a parameterized model matches an image.
The fitting algorithm Searches for the model parameters that describe the image best, i.e. it minimizes the objective function.
2008, January 23rd 4/15Technische Universität MünchenSylvia Pietzsch
Local Objective Functions
2008, January 23rd 5/15Technische Universität MünchenSylvia Pietzsch
Ideal Objective FunctionsP1: Correctness property:
Global minimum corresponds to the best fit.P2: Uni-modality property:
The objective function has no local extrema. ¬ P1 P1
¬P2
P2
Don’t exist for real-world images
Only for annotated images: fn( I , x ) = | cn – x |
2008, January 23rd 6/15Technische Universität MünchenSylvia Pietzsch
Learning the Objective Function
x x x xx
xxx x xxx x x x
x x xx x
x xx x x x x
x xxx x
Ideal objective function generates training data Machine Learning technique generates calculation rules
2008, January 23rd 7/15Technische Universität MünchenSylvia Pietzsch
Skin Color Extraction Location of contour
lines and skin colored parts
Adaptive to image context conditions
orig
ina
l
ima
ge
fixed
classifie
r
ad
ap
ted
classifie
r
Correctly detected pixels: fixed classifier: 90.4% 74.8% 40.2% adapted classifier: 97.5% 87.5% 97.0%
2008, January 23rd 8/15Technische Universität MünchenSylvia Pietzsch
Structural Features Deformation parameters describe a distinctive
state of the face.
2008, January 23rd 9/15Technische Universität MünchenSylvia Pietzsch
Temporal Features Facial expressions emerge from muscle activity.
Optical flow vectors are calculated at equally distributed feature points connected to the shape model.
2008, January 23rd 10/15Technische Universität MünchenSylvia Pietzsch
Audio Low-level Descriptors Aiming at independence of phonetic content and speaker Coverage of prosodic, articulatory, and voice quality aspects 20ms frames, 50% overlap, Hamming window function
Zero crossing rate (ZCR) Pitch 7 formants Energy Spectral development Harmonics-to-Noise-Ratio (HNR) Durations of voiced sounds by HNR Durations of silences by bi-state energy
SMA filtering of LLDs Addition of 1st and 2nd order LLD regression coefficients
2008, January 23rd 11/15Technische Universität MünchenSylvia Pietzsch
Combining Audio and Video LLDs Time series constructed for LLDs (audio, video
separately)
Application of functionals to combined low-level descriptors Linear moments (mean, std. deviation) Quartiles Durations
Resulting feature vector: 276 audio features 1048 video features
SVM
2008, January 23rd 12/15Technische Universität MünchenSylvia Pietzsch
Experimental Results (1) Database: Airplane Behavior Corpus
Guided storyline 8 subjects (25 to 48 years old) 11.5 hours of video in total
10-fold stratisfied cross validation
Feature pre-selection by SVM-SFFS (sequential forward floating search)
Audio Video Audiovisual
Features [#] 92 156 200
Accuracy [%] 73.7 61.1 81.8
2008, January 23rd 13/15Technische Universität MünchenSylvia Pietzsch
Experimental Results (2)
Main confusions: neutral, nervous cheerful, intoxicated
Aggressive behavior recognized best
2008, January 23rd 14/15Technische Universität MünchenSylvia Pietzsch
Conclusion and Outlook Combined feature set superior over individual
audio or video feature set
Future work: Investigation on further data sets Comparison to late fusion approaches Performance of asynchronous feature fusion Application of hierarchical functionals
2008, January 23rd 15/15Technische Universität MünchenSylvia Pietzsch
Thank you!