IBM TRECVID’08 High-level Feature Detection

1

© 2007 IBM Corporation

IBM TRECVID’08 High-level Feature Detection

Rong YanIBM T. J. Watson Research CenterHawthorne, NY 10532 USA

Team: Apostol Natsev, Wei Jiang, Michele Merler, John R. Smith, Jelena Tesic, Lexing Xie, Rong Yan

© 2007 IBM Corporation2

Main Research Highlights

� Distributed end-to-end machine learning system

� Efficient model learning w. random subspace bagging

� Semi-supervised joint feature and concept modeling

� Cross-domain learning from web domain

2


Distributed End-to-End Concept Modeling Tool

� Scalable, configurable and extensible based on UIMA and ActiveMQ

� Improve concept model throughput by 10x over last year

WebSphere Cluster

Server Host -1

QM-1

Q-1

WAS-1

ServiceDispatcher

UIMA-FC

Server Host -2

QM-2

Q-2

WAS-2

ServiceDispatcher

UIMA-FC

MQCluster

Node-1

Flow Controller

UIMA-FC

Aggregate-1Flow

Controller

UIMA-AE

Delegate-1a

Flow Controller

UIMA-AE

Delegate-1b

Node-2

Flow Controller

UIMA-FC

Aggregate-2Flow

Controller

UIMA-AE

Delegate-2a

Flow Controller

UIMA-AE

Delegate-2b

1

1

n

n


System Overview

Video Collection

Video Collection

Annotation(multiple versions)


Glo

bal

Featu

re

Glo

bal

Featu

re

Color Corr.Color Corr.

Color Hist.Color Hist.

TextureTexture

Gri

dF

eatu

re

Gri

dF

eatu

re

Color Mom.Color Mom.

WaveletWavelet

TextText

AS

RA

SR

RS-BagRS-Bag

PCS3VMPCS3VM

Cross Domain (Web)

Cross Domain

(Web)

Text SearchText Search

AverageAverage

Learn-based

Weighting

Learn-based

Weighting

Cross-Concept

Cross-Concept

Feature Extraction Model Learning Fusion

3


System Overview

Video Collection

Video Collection



Glo

bal

Featu

re

Glo

bal

Featu

re



TextureTextureG

rid

Featu

re

Gri

dF

eatu

re


WaveletWavelet

TextText

AS

RA

SR

RS-BagRS-Bag

PCS3VMPCS3VM

Cross Domain

(Web)

Cross Domain

(Web)


AverageAverage

Learn-based

Weighting

Learn-based

Weighting

Cross-Concept

Cross-Concept

Feature Extraction Model Learning Fusion

Glo

bal

Featu

re

Glo

bal

Featu

re



TextureTextureG

rid

Featu

re

Gri

dF

eatu

re


WaveletWavelet

TextText

AS

RA

SR

RS-BagRS-Bag

PCS3VMPCS3VM

Cross Domain

(Web)

Cross Domain

(Web)


AverageAverage

Learn-based

Weighting

Learn-based

Weighting

Cross-Concept

Cross-Concept

5

6

4

32

1


Outline

Baseline: Random subspace bagging

Principled Component Semi-Supervised SVM

Learning from Web Domain

Conclusion

4


Highlights on Feature Extraction

� Low-level features: significantly increase from 7 types to 98 types

– Following the successful stories of previous TRECVID experiments

– 13 different visual descriptors (e.g., color histogram, edge histogram … )

– 8 granularities (i.e., global, center, cross, grid, horizontal parts, horizontal center, vertical parts and vertical center)

– Future work: incorporating local features (interest point / dense sampled)

� Text search for concept detection

– Text features officially provided by NIST

– Juru search engine with manually expanded keywords on development set


Baseline: Random-Subspace Bagging

� Random-Subspace Bagging

– For each concept and each low-level feature, select multiple bags of examples training set, sampled from training examples and feature space

– Learn a SVM model on each bag

– Evaluate the cross-validation performance

– Select and fuse these models into an ensemble model based on held-out data

– Similar to Random Forest w. decision trees

� Advantages:

– Improve computational efficiency

– Detection errors are theoretically bounded

– Easy to parallelize using MapReduce w/o major changes for learning packages

Features

SVM1SVM1

Concept ModelsConcept Models

SVM2SVM2

Tra

inin

g V

ideo

5


Algorithm Details & Highlights

Learn SVM base model w. sampled data & features

Evaluate cross-validation performance for the model

Mapping Phase

Select nth base model w. greedy/combined strategy

Combine selected model

Reducing Phase


Recap: Random Subspace Bagging in TRECVID’07

� RSBag provides a more than 10-fold speedup on learning process and

even a slightly better performance than the baseline SVM-07 approach

0.0870

0.0844

0.0667

0.0638

MAP

--SVM-07 + SVM-Min07 + Text + HoG

--RSBag + SVM-Min07 + Text + HoG

2342-RSBag

24602-SVM-07

TimeRunDescription

(RSBag: random subspace bagging w. 3 models, data sample ratio 0.2 and feature sample ratio 0.5)

6


TREC’08 Internal Validation Results for RSBag

� Setting: 70% of devel. data as training set, 30% as validation set

1. Consistently better validation performance with larger bag

size, but saturate around 1000 training data per bag

2. The combined strategy consistently outperforms the greedy strategy

3. Sampling the feature space with a ratio of 50% gives slightly worse result, but much less training time


Official Evaluation Results: Baseline (RSBag), Text

� Observations

– Fusion on multiple learning configurations / features almost always help, even when performance of individual component is not as good as baseline

– Text features improve the performance on “Airplane”, “Boat_Ship”, but hurt the performance on “Nighttime”, “Street”

0.1010A-RSBag w. Combined

0.0231A-Text

0.1240A-Baseline + Text

0.1052ABaseline_5Baseline (RSBag fused)

0.0975A-RSBag w. Greedy

MAPTypeRunDescription

7


Outline

Baseline: Random subspace bagging

Principled Component Semi-Supervised SVM


Conclusion


Principled Component Semi-Supervised SVM (PCS3VM)

Small sample learning difficulty: limited labeled data, high dimensionality

Existing solutions:

1. Semi-supervised learning: consider both XL and X

U

Pursue discrimination over XL

with constraints from both XL

and XU

,

under some assumption about data distribution.

2. Dimension reduction: Reduce dimensionality of X

Learn a feature subspace that preserves certain properties of the

data distribution, e.g., PCA preserves data diversity and LDA

preserves discriminant property.

8


PCS3VM: Joint Feature Subspace and SVM Learning

2

2 1, ,

1min min || || ,

2

Ln

d iiQ C ε

=

= +

∑

w b w b

w . . ( ( ) ) 1 , 0, xT

i i i i Ls t y b Xφ ε ε+ ≥ − ≥ ∀ ∈w x

Supervised SVM: learn classifier (w, b) pursing discrimination over XL

{ }max max ,T T

fQ tr XX=

a a

a a . . Ts t I=a a

PCA: learn projection a, preserving the variance over entire data X

Our Algorithm: jointly learn projection a and classifier (w, b)

, ,min( ),d fQ Q−w b a

. . , ( ( ) ) 1 , 0, xT T T

i i i i Ls t I y b Xφ ε ε= + ≥ − ≥ ∀ ∈a a w a x

Learn feature subspace that is discriminative over labeled data XL,

and simultaneously learn large margin SVM in the feature subspace

Proposed Solution:


Evaluation Results: PCS3VM

� Observations

– Although PCS3VM does not outperform baseline on average, it achieves better results on several concepts such as “nighttime” and “driver”.

– Combined with baseline, it obtains 12% performance gain in terms of MAP.

0.1052ABaseline_5Baseline

0.1182ABaseSSL_4Baseline + PCS3VM

0.1268ABaseSSLText_3Baseline + PCS3VM + Text

0.0750A-PCS3VM


9



� Online channels have provided a rich

source of multimedia training data

– How useful for learning TREC concepts?

� Approaches for web domain learning

– Manually create 2-5 queries per concept

– Download top 100-200 images from two different websites: Google & Flickr

– Manually annotate all the images, end up with 30,000 images for 20 concepts

– Use the baseline method to learn models


Detailed Evaluation Results on Learning Web Domain

� The success of learning from web domain depends on the generality of

the target domain, e.g. “emergency vehicle” (+) and “dog” (-)

� Improve 6 out of 20 concepts, but MAP (0.052) is lower than baseline

– The gap become much smaller (0.052 vs. 0.070) after removing 4 concepts

00.2508Dog

0.14110.1568Boat Ship

0.13390.0919Airplane

0.0620 (rank

4th in 200 runs)

0.0076Emergency

Vehicle

Web BaselineConceptTRECVID Web

10


Cross-concept Detection to Combine Web Concepts

� Concept relational models to improve detection performance

– Naïve Bayes algorithm by estimating pairwise conditional probabilities viamaximum likelihood

– The conditional probabilities are smoothed by prior probabilities

– Use 20 Type-A concepts and 200+ concepts learned from web domain

Sky: ?Sky: ?Grass: ?Grass: ? Outdoor: ?Outdoor: ?

Sky: 0.9Sky: 0.9Grass: 0.2Grass: 0.2 Outdoor: 0.8Outdoor: 0.8


Evaluation Results: Web Domain & Cross-Concept

� Observations

– Fusing web training data with baseline using multi-concept learning approaches, we can improve MAP by another 2%

– Improve a number of concepts, e.g., “Airplane flying” and “Kitchen”



0.1200CBNet_2Cross-Concept: Baseline, Web, PCS3VM

0.0519CCrossDomain_6Web Domain Learning


11


Summarization of Submitted Runs

0.1200CBNet_2Cross-Concept: Baseline, Web, PCS3VM

0.1340CBOR_1Best Overall Run

0.1268ABaseSSLText_3Baseline + PCS3VM + Text

0.0519CCrossDomain_6Web Domain




� Submitted: 5 runs + 1 Best Overall Run (~0.134 MAP)

� Almost all components help, even individual MAP is worse than baseline

� Best run improve over visual baseline by 30%


Overall Performance

12


Concluding Remarks

� Large-scale distribute machine learning system, improve

computational throughput by 10x over last year

� Random subspace bagging considerably improve computational

efficiency w/o hurting performance, also easy to parallelize

� PCS3VM jointly learns feature space and large-margin SVMs,

provide better performance after combined with baseline

� Web domain training data can be leveraged for generic concepts or

infrequent concepts on target domain

� Future directions

– More scalable distributed algorithm, local features, non-visual meta-data…

Documents

IBM TRECVID’08 High-level Feature Detection