~ Multimodal Video Classification ~
University POLITEHNICA of Bucharest
*this work was partially supported under European Structural Funds EXCEL POSDRU/89/1.5/S/62557.
Austrian Research Institute for Artificial Intelligence
Bogdan IONESCU*1,3
Ionuț MIRONICĂ1
Klaus SEYERLEHNER2
Peter KNEES2
Jan SCHLÜTER4
Markus SCHEDL2
Horia CUCU1
Andi BUZO1
Patrick LAMBERT3
ARF (Austria-Romania-France) team
1 2 3 4
2
Presentation outline
MediaEval - Pisa, Italy, 4-5 October 2012 1/16
• The approach
• Video content description
• Experimental results
• Conclusions and future work
3
The approach
MediaEval - Pisa, Italy, 4-5 October 2012 2/16
video database
> challenge: find a way to assign (genre) tags to unknown videos;
> approach: machine learning paradigm;
train
classifier
unlabeled data
web food autos
…label data
labeled data
tagged video database
4
The approach: classification
MediaEval - Pisa, Italy, 4-5 October 2012 3/16
> the entire process relies on the concept of “similarity” computed between content annotations (numeric features),
objective 1: go multimodal (truly)
visual audio text
objective 2: test a broad range of classifiers and descriptor combinations;
> this year focus is on:
5
Video content description - audio
MediaEval - Pisa, Italy, 4-5 October 2012 4/16
[Klaus Seyerlehner et al., MIREX’11, USA]
average
median
variance
...
e.g. 50% overlapping
block-level audio features (capture also local temporal information)
• Spectral Pattern, ~ soundtrack’s timbre;
• delta Spectral Pattern, ~ strength of onsets;
• variance delta Spectral Pattern, ~ variation of the onset strength;
• Logarithmic Fluctuation Pattern, ~ rhythmic aspects;
• Spectral Contrast Pattern, ~ ”toneness”;
• Correlation Pattern, ~ loudness changes;
• Local Single Gaussian model,~ timbral;
• George Tzanetakis model,~ timbral;
6
Video content description - audio
MediaEval - Pisa, Italy, 4-5 October 2012 5/16
[B. Mathieu et al., Yaafe toolbox, ISMIR’10, Netherlands]
• Linear Predictive Coefficients,
• Line Spectral Pairs,
• Mel-Frequency Cepstral Coefficients,
• Zero-Crossing Rate,
+ variance of each feature over a certain window.
• spectral centroid, flux, rolloff, and kurtosis,
standard audio features (audio frame-based)
f1 fn…f2
globalfeature
= mean & variance
time
+var{f2} var{fn}
7
Video content description - visual
MediaEval - Pisa, Italy, 4-5 October 2012 6/16
[OpenCV toolbox, http://opencv.willowgarage.com]
MPEG-7 & color/texture descriptors(visual frame-based)
• Local Binary Pattern,
• Autocorrelogram,
• Color Coherence Vector,
• Color Layout Pattern,
• Edge Histogram,
• Scalable Color Descriptor,
• Classic color histogram,
• Color moments.
time
f1 fn…
globalfeature
=mean &
dispersion & skewness & kurtosis & median &
root mean square
f2
8
Video content description - visual
MediaEval - Pisa, Italy, 4-5 October 2012 7/16
[OpenCV toolbox, http://opencv.willowgarage.com]
feature descriptors(visual frame-based)
• Histogram of oriented Gradients (HoG)~ counts occurrences of gradient orientation in localized portions of an image (20º per bin)
• Harris corner detector
• Speeded Up Robust Feature (SURF)
feature points (e.g. Harris)
image source http://www.ifp.illinois.edu/~yuhuang
9
Video content description - text
MediaEval - Pisa, Italy, 4-5 October 2012 8/16
TF-IDF descriptors(Term Frequency-Inverse Document Frequency)
> text sources: ASR and metadata,
1. remove XML markups,
2. remove terms <5%-percentile of the frequency distribution,
3. select term corpus: retaining for each genre class m terms (e.g. m = 150 for ASR and 20 for metadata) with the highest χ2 values that occur more frequently than in complement classes,
4. for each document we represent the TF-IDF values.
10
avg. Fscore (over all genres)
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 9/16
> classifiers from Weka (Bayes, lazy, functional, trees, etc.),
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
> cross-validation (train 50% – test 50%),
- visual descriptors capabilities 30%±10%,
- best LBP+CCV+histogram (Fscore=41.2%).
- using more visual is not more accurate than using few,
11
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 10/16
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
avg. Fscore (over all genres)
> cross-validation (train 50% – test 50%),
- proposed block-based better than standard (by ~10%),
- audio still better than visual (improvement ~6%),
12
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 11/16
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
avg. Fscore (over all genres)
- best performance ASR LIMSI + metadata (Fscore=68%).
> cross-validation (train 50% – test 50%),
- ASR from LIMSI more representative than LIUM (~3%),
13
Experimental results: devset (5,127 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 12/16
[Weka toolbox, http://www.cs.waikato.ac.nz/ml/weka/]
avg. Fscore (over all genres)
- increasing the number of modalities increases the performance.
- audio-visual close to text (ASR) for the automatic descriptors,
> cross-validation (train 50% – test 50%),
14
Experimental results: official runs (9,550 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 13/16
> train on devset, test on testset (SVM linear),
Run1 LBP+CCV+
hist + audio block-based
Run2
TF-IDF on ASR LIMSI
Run3
audio block-based + LBP + CCV + hist +
TF-IDF on ASR LIMSI
Run4
audio
block-based
Run5
TF-IDF on metadata +
ASR LIMSI
MediaEval
2011
MAP 10.3%
MediaEval
2011
MAP 12%
metadata
15
Experimental results: official runs (9,550 seq.)
MediaEval - Pisa, Italy, 4-5 October 2012 14/16
> genre MAP for Run 5: TF-IDF on ASR + metadata,
Run 1: visual + audio religion
71%gaming
71%autos52%
environment50%
16
Conclusions and future work
MediaEval - Pisa, Italy, 4-5 October 2012 15/16
> classification adapts to the corpus – changing the corpus will change the performance;
> future work:
more elaborated late-fusion ?
pursue tests on the entire data set;
perhaps more elaborated Bag-of-Visual-Words.
Acknowledgement: we would like to thank Prof. Fausto Giunchiglia and Prof. Nicu Sebe from University of Trento for their support.
> how far can we go with ad-hoc classification without human intervention?
> audio-visual descriptors are inherently limited;
17
thank you !
MediaEval - Pisa, Italy, 4-5 October 2012 16/16
any questions ?