33
Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1 , Yu-Gang Jiang 1,2 , Junfeng He 1 , Elie El Khoury 3 , Chong-Wah Ngo 2 1 DVMM Lab, Columbia University 2 City University of Hong Kong 3 IRIT, Toulouse, France TRECVID 2008 workshop, NIST

Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Embed Size (px)

Citation preview

Page 1: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Concept Detection: Convergence to Local Features

and Opportunities Beyond

Shih-Fu Chang1, Yu-Gang Jiang1,2, Junfeng He1, Elie El Khoury3,Chong-Wah Ngo2

1 DVMM Lab, Columbia University

2 City University of Hong Kong3 IRIT, Toulouse, France

TRECVID 2008 workshop, NIST

Page 2: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Overview: 5 components & 6 runs

5 components

2

Local FeatureLocal Feature

Global FeatureGlobal Feature

CU-VIREO374CU-VIREO374

Web ImagesWeb Images

374-d fea.

Face & AudioFace & Audio

SVMSVM

Classifiers

66

55

44

33

FilteringFiltering

1144

22

Page 3: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Overview: overall performance

– Local feature alone already achieves near top performance– Every other component contributes incrementally to the

final detection

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Mea

n Av

erag

e Pr

ecis

ion

TRECVID 2008 Type-A Submissions (161)

3

Page 4: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Overview: per-concept performance

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

CU_2_run4+face&audio CU_4_run5+cu-vireo374

CU_5_local_global CU_6_local_only

MEAN MAX

4

Page 5: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Outline

5

5 components 5 components

Local FeatureLocal Feature

Global FeatureGlobal Feature

CU-VIREO374CU-VIREO374

Web ImagesWeb Images

374-d fea.374-d fea.

Face & AudioFace & Audio

SVMSVM

ClassifiersClassifiers

66

55

44

33

FilteringFiltering

1144

22

Page 6: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Bag-of-Visual-Words (BoW)

Keypoint extraction Visual word vocabulary

BoW histogram

Keypoint feature space

...

Clustering

............

v2 v3v1 v4

...

6

Page 7: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Representation Choices of BoW

• Word weighting scheme– How to weight the importance of a word to an

image?

• Spatial information– Are the spatial locations of keypoints useful?

7

Page 8: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Weighting Scheme

• Traditional…– Binary, Term frequency (TF), inverse document frequency

(IDF)…

• Our method – soft weighting

Image from http://www.cs.joensuu.fi/pages/franti/vq/lkm15.gif

-- Assign a keypoint to multiple visual words

-- weights are determined by keypoint-to- word similarity

Details in:Jiang et al. CIVR 2007.

8

Page 9: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Vocabulary Size & Weighting Scheme

0

0.02

0.04

0.06

0.08

0.1

0.12

500 1,000 5,000 10,000

Mea

n A

vera

ge P

reci

sion

Vocabulary Size

TRECVID 2006 Test DataBinary TF TF-IDF Soft

– Soft weighting • Improve TF by 10%-20%

– More accurate to assess the importance of a keypoint

9

Page 10: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Spatial Information

• Partition image into equal-sized regions• Concatenate BoW features from the regions– Poor generalizability

F = (f11, f12, f13, f21, f22, f23, f31, f32, f33)10

Page 11: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Spatial Information

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

500 1000 5000 10000

Mea

n av

erag

e pr

ecisi

on

Vocabulary size

TRECVID 2006 Test Data (soft-weighting)

1×1 region 2×2 regions 3×3 regions 4×4 regions

– Spatial Information does not help much for concept detection• 2x2 is a good choice• 3x3 and 4x4 may cause mismatch problem

11

Page 12: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Local Feature Representation Framework

• a Keypoint extraction

Visual word vocabulary 1

SIF

T fe

atur

e sp

ace

......... .........

Visual word vocabulary 2

.........

Visual word vocabulary 3

DoG Hessian Affine MSER

BoW histograms Using Soft-Weighting

... + + +

+ + ++ + ++ + +

+

- - -

- - -

- - -

- - - - - -

- - - - - - - - -

SVM classifiers

... ......

+ + +

+ + ++ + ++ + +

+

- - -

- - -

- - -

- - - - - -

- - - - - - - - -

+ + +

+ + ++ + +

- - -

- - -

- - -

- - - - - -

- - - - - - - - -

Vocabulary Generation BoW Representation

K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, L. Van Gool, “A comparison of affine region detectors”, IJCV, vol. 65, pp. 43-72, 2005.K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, L. Van Gool, “A comparison of affine region detectors”, IJCV, vol. 65, pp. 43-72, 2005.

12

Page 13: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Internal Results – Local Features

• Over TRECVID 2008 Test Data

Similar! 13%13%

13

MAP: 0.157MAP: 0.157

Page 14: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Failure Cases - I

14

• Flower– Small visual area

– Coloration/texture too similar to background scene

• Possible Solutions– Color-descriptor

– Class-specific visual words

missesmisses

Page 15: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Failure Cases - II

15

• Boat_Ship, Airplane_flying– Learning biased by background

scene

– Difficulty from occlusion

• Possible Solution– Feature selection

missesmisses

Page 16: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Summary – Local Features

16

• BoW with good representation choices achieved very impressive performance• Soft-weighting is very effective• Multiple spatial layouts are useful• Multi-detectors do not help much

• Rooms for future improvement• Class-specific visual words, feature selection,

color-descriptor etc.

Page 17: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Outline

17

5 components 5 components

Local FeatureLocal Feature

Global FeatureGlobal Feature

CU-VIREO374CU-VIREO374

Web ImagesWeb Images

374-d fea.374-d fea.

Face & AudioFace & Audio

SVMSVM

ClassifiersClassifiers

66

55

44

33

FilteringFiltering

1144

22

Page 18: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Global Features

18

• Grid-based Color Moments (225-d)• Wavelet texture (81-d)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4A_ CU_6_local_only_6 A_ CU_5_local_global_5

Page 19: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Outline

19

5 components 5 components

Local FeatureLocal Feature

Global FeatureGlobal Feature

CU-VIREO374CU-VIREO374

Web ImagesWeb Images

374-d fea.374-d fea.

Face & AudioFace & Audio

SVMSVM

ClassifiersClassifiers

66

55

44

33

FilteringFiltering

1144

22

Page 20: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

CU-VIREO374• Fusion of Columbia374 and VIREO374

Feature Dimension

Columbia374

Grid-based color moment (LUV) 225

Gabor Texture 48

Edge Direction Histogram 73

VIREO374

Bag-of-visual-words (soft weighting) 500

Grid-based Color Moment (Lab) 225

Grid-based Wavelet Texture 81

Performance of CU-VIREO374 over TRECVID 2006 Test Data

CU-VIREO374

VIREO374

Columbia374

Scores on the TRECVID2008 corpora:http://www.ee.columbia.edu/ln/dvmm/CU-VIREO374/

Scores on the TRECVID2008 corpora:http://www.ee.columbia.edu/ln/dvmm/CU-VIREO374/

20Yu-Gang Jiang, Akira Yanagawa, Shih-Fu Chang, Chong-Wah Ngo, "Fusing Columbia374 and VIREO-374 for Large Scale Semantic Concept Detection", Columbia University ADVENT Technical Report #223-2008-1, Aug. 2008.

Yu-Gang Jiang, Akira Yanagawa, Shih-Fu Chang, Chong-Wah Ngo, "Fusing Columbia374 and VIREO-374 for Large Scale Semantic Concept Detection", Columbia University ADVENT Technical Report #223-2008-1, Aug. 2008.

Page 21: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Concept Fusion Using CU-VIREO374

• Train a SVM for each concept– Using CU-VIREO374 scores as features

– Performance improvement is merely 2%• Need a better concept fusion model!

2.2%2.2%

21

Run5: Local+global

Page 22: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Outline

22

5 components 5 components

Local FeatureLocal Feature

Global FeatureGlobal Feature

CU-VIREO374CU-VIREO374

Web ImagesWeb Images

374-d fea.374-d fea.

Face & AudioFace & Audio

SVMSVM

ClassifiersClassifiers

66

55

44

33

FilteringFiltering

1144

22

Page 23: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Exploring External Images from Web

• Problem– Sparsity of positive data

Concept Name # Positive shots Concept Name # Positive shots

Classroom 224 Harbor 195

Bridge 158 Telephone 184

Emergency_Vehicle 88 Street 1551

Dog 122 Demonstration/Protest 134

Kitchen 250 Hand 1515

Airplane_flying 72 Mountain 239

Two_people 3630 Nighttime 424

Bus 87 Boat_Ship 437

Driver 258 Flower 582

Cityscape 288 Singing 366

Total # of shots in TV’08 Dev: 36,262

23

Page 24: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Challenging Issues

• How to make use of the large amount of “noisily labeled” web images for concept detection?– Issue 1: filter the false positive samples

Good

Bad

24

Flickr Images

Page 25: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Challenging Issues

• How to make use of the large amount of “noisily labeled” web images for concept detection?– Issue 1: filter the false positive samples– Issue 2: overcome the cross-domain problem

Flickr

TRECVID

25

Page 26: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Preliminary Results

• Web image set: 18,000 from Flickr– Issue 1: filter the false positive samples

• Graph based semi-supervised learning

– Issue 2: overcome the cross-domain problem• Weighted SVM

• Results– MAP: no difference– “Bus”: improve 50%

• Open Problem!0

0.05

0.1

0.15

0.2

0.25

0.3A_CU_Run-4

C_CU_Run-3 (Bug Free)

26

Page 27: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Outline

27

5 components 5 components

Local FeatureLocal Feature

Global FeatureGlobal Feature

CU-VIREO374CU-VIREO374

Web ImagesWeb Images

374-d fea.374-d fea.

Face & AudioFace & Audio

SVMSVM

ClassifiersClassifiers

66

55

44

33

FilteringFiltering

1144

22

Page 28: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Face Detection and Tracking

• Face Detection (OpenCV Toolbox)• Tracking based on face location and skin color

Character 1

Start Frame End Frame

Backward tracking

Forward tracking

Pt1Pt2

Pt1Pt2

...

Pt1

Pt2

x

x

Face Detection

Tracking

28

Page 29: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

“Two_people” Detector

• Drawback– Cannot find person when face is too small or

invisible

•2 people•250 frames/person1 & 150 frames/person2

•100 frames/1 person & 150 frames/2 people

•250 frames

29

Page 30: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Detecting “Singing” based on Audio

• Vibrato – “the variation of the frequency of an musical instrument

or of the voice”

• Harmonic Coefficient Ha– It corresponds to the most important trigonometric series

of the spectrum– Ha is higher in the presence of singing voice

30

Page 31: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Performance – Face & Audio

• Improve “two_people” by 4% and “singing” by 8%– Simple heuristics help detect specific concepts.

31

Page 32: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

Conclusions• Convergence to Local Features

– Local feature alone achieved an impressive MAP of 0.157• Representation choices are critical for good performance

– The combination of local features and global features introduces a moderate gain (MAP 0.162)

• CU-VIREO374– Useful resource for concept fusion and video search– A better fusion model is needed

• Face & Audio detectors– Simple heuristics help detect specific concepts

• Training from external web images – open problem– Useful for concepts lacking in positive training samples– Challenges:

• Unreliable labels • Domain differences

32

Page 33: Concept Detection: Convergence to Local Features and Opportunities Beyond Shih-Fu Chang 1, Yu-Gang Jiang 1,2, Junfeng He 1, Elie El Khoury 3, Chong-Wah

More information at:http://www.ee.columbia.edu/dvmm/

33