174
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg ICVSS 2011: Selected Presentations Angel Cruz and Andrea Rueda BioIngenium Research Group, Universidad Nacional de Colombia August 25, 2011 Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

ICVSS2011 Selected Presentations

  • Upload
    aacruzr

  • View
    791

  • Download
    2

Embed Size (px)

DESCRIPTION

This is a presentation to share the experiences and selected presentation from International Computer Vision Summer School (ICVSS2011) attended by Angel Cruz and Andrea Rueda from Bioingenium Research Group of Universidad Nacional de Colombia.

Citation preview

Page 1: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

ICVSS 2011: Selected Presentations

Angel Cruz and Andrea Rueda

BioIngenium Research Group, Universidad Nacional de Colombia

August 25, 2011

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 2: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Efficient Novel Class Recognition and Search - LorenzoTorresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 3: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Efficient Novel Class Recognition and Search - LorenzoTorresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 4: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

ICVSS 2011International Computer Vision Summer School

15 speakers, from USA, France, UK, Italy, Prague and Israel

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 5: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

ICVSS 2011International Computer Vision Summer School

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 6: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

ICVSS 2011International Computer Vision Summer School

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 7: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Efficient Novel Class Recognition and Search - LorenzoTorresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 8: ICVSS2011 Selected Presentations

A Trillion Photos

Steve SeitzUniversity of Washington

Google

Sicily Computer Vision Summer SchoolJuly 11, 2011

Page 9: ICVSS2011 Selected Presentations

Facebook

>3 billion uploaded each month

~ trillion photos taken each year

Page 10: ICVSS2011 Selected Presentations

What do you do with a trillion photos?

Digital Shoebox(hard drives, iphoto, facebook...)

Page 11: ICVSS2011 Selected Presentations
Page 12: ICVSS2011 Selected Presentations
Page 13: ICVSS2011 Selected Presentations
Page 14: ICVSS2011 Selected Presentations
Page 15: ICVSS2011 Selected Presentations
Page 16: ICVSS2011 Selected Presentations
Page 17: ICVSS2011 Selected Presentations

?

Page 18: ICVSS2011 Selected Presentations

Comparing images

Detect features using SIFT [Lowe, IJCV 2004]

Page 19: ICVSS2011 Selected Presentations

Comparing images

Extraordinarily robust image matching– Across viewpoint (~60 degree out-of-plane rotations)

– Varying illumination

– Real-time implementations

Page 20: ICVSS2011 Selected Presentations

Edges

Page 21: ICVSS2011 Selected Presentations

Scale Invariant Feature Transform

Adapted from slide by David Lowe

0 2πangle histogram

Page 22: ICVSS2011 Selected Presentations

NASA Mars Rover images

Page 23: ICVSS2011 Selected Presentations

NASA Mars Rover imageswith SIFT feature matchesFigure by Noah Snavely

Page 24: ICVSS2011 Selected Presentations
Page 25: ICVSS2011 Selected Presentations

St. Peters (inside)

Trevi Fountain

St. Peters (outside)

Il Vittoriano

Coliseum(inside)

Coliseum(outside)

Forum

Page 26: ICVSS2011 Selected Presentations

Structure from motion

3D structureMatched photos

Page 27: ICVSS2011 Selected Presentations

Structure from motion

Camera 1

Camera 2

Camera 3R1,t1

R2,t2

R3,t3

p1

p4

p3

p2

p5

p6

p7

minimizef (R, T, P)

aka “bundle adjustment” (texts: Zisserman; Faugeras)

Page 28: ICVSS2011 Selected Presentations

?

Page 29: ICVSS2011 Selected Presentations

Reconstructing RomeIn a day...

From ~1M imagesUsing ~1000 cores

Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitzhttp://grail.cs.washington.edu/rome

Page 30: ICVSS2011 Selected Presentations

Rome 150K: Colosseum

Page 31: ICVSS2011 Selected Presentations

Rome: St. Peters

Page 32: ICVSS2011 Selected Presentations

Venice (250K images)

Page 33: ICVSS2011 Selected Presentations

Venice: Canal

Page 34: ICVSS2011 Selected Presentations

Dubrovnik

Page 35: ICVSS2011 Selected Presentations

Sparse output from the SfM system

From Sparse to Dense

Page 36: ICVSS2011 Selected Presentations

From Sparse to Dense

Furukawa, Curless, Seitz, Szeliski, CVPR 2010

Page 37: ICVSS2011 Selected Presentations
Page 38: ICVSS2011 Selected Presentations
Page 39: ICVSS2011 Selected Presentations
Page 40: ICVSS2011 Selected Presentations
Page 41: ICVSS2011 Selected Presentations

Most of our photos don’t look like this

Page 42: ICVSS2011 Selected Presentations
Page 43: ICVSS2011 Selected Presentations

recognition + alignment

Page 44: ICVSS2011 Selected Presentations

Your Life in 30 Seconds

path optimization

Page 45: ICVSS2011 Selected Presentations

Picasa Integration• As “Face Movies” feature in v3.8

– Rahul Garg, Ira Kemelmacher

Page 46: ICVSS2011 Selected Presentations

Conclusion

trillions of photos + computer vision breakthroughs

= new ways to see the world

Page 47: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Efficient Novel Class Recognition and Search - LorenzoTorresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 48: ICVSS2011 Selected Presentations

Efficient Novel-Class Recognition and Search

Lorenzo Torresani

Page 49: ICVSS2011 Selected Presentations

• no text/tags available

• query images may represent a novel class

user-provided imagesof an object class

+

image database(e.g., 1 million photos)

• Given:

• Want:

database images

of this class

Problem statement: novel object-class search

Page 50: ICVSS2011 Selected Presentations

Application: Web-powered visual search in unlabeled personal photos

1 Search the Web for images of “soccer camp”

2 Find images of this visual class on my computer

Find “soccer camp” pictures on my computer

Goal:

1

2

Page 51: ICVSS2011 Selected Presentations

Application: product search

• Search of aesthetic products

Page 52: ICVSS2011 Selected Presentations

image retrieval object categorization

novel class search

analogies:- large databases- efficient indexing- compact representation

differences:- simple notions of visual relevancy (e.g., near-duplicate, same object instance, same spatial layout)

Figure 5. The retrieval performance is evaluated using a largeground truth database (6376 images) with groups of four imagesknown to be taken of the same object, but under differentconditions. Each image in turn is used as query image, and thethree remaining images from its group should ideally be at thetop of the query result. In order to compare against less efficientnon-hierarchical schemes we also use a subset of the databaseconsisting of around 1400 images.

settings with a 1400 image subset of the test images. Thecurves show the distribution of how far the wanted imagesdrop in the query rankings. The points where a largernumber of methods meet the y-axis are given in Table 1.Note especially that the use of a larger vocabulary andalso L1 - norm gives performance improvements over the

Figure 6. Curves showing percentage (y-axis) of the ground truthquery images that make it into the top x percent (x-axis) framesof the query for a 1400 image database. The curves are shownup to 5% of the database size. As discussed in the text, it iscrucial for scalable retrieval that the correct images from thedatabase make it to the very top of the query, since verificationis feasible only for a tiny fraction of the database when thedatabase grows large. Hence, we are mainly interested in wherethe curves meet the y-axis. To avoid clutter, this number isgiven in Table 1 for a larger number of settings. A number ofconclusions can be drawn from these results: A larger vocabularyimproves retrieval performance. L1-norm gives better retrievalperformance than L2-norm. Entropy weighting is important, atleast for smaller vocabularies. Our best setting is method A, whichgives much better performance than the setting used by [17], whichis setting T.

settings used by [17].

The performance with various settings was also testedon the full 6376 image database. It is important to note thatthe scores decrease with increasing database size as thereare more images to confuse with. The effect of the shapeof the vocabulary tree is shown in Figure 7. The effects ofdefining the vocabulary tree with varying amounts of dataand training cycles are investigated in Figure 8.

Figure 10 shows a snapshot of a demonstration of themethod, running real-time on a 40000 image database ofCD covers, some connected to music. We have so fartested the method with a database size as high as 1 millionimages, more than one order of magnitude larger than anyother work we are aware of, at least in this category ofmethod. The results are shown in Figure 9. As we couldnot obtain ground truth for that size of database, the 6376image ground truth set was embedded in a database that alsocontains several movies: The Bourne Identity, The Matrix,Braveheart, Collateral, Resident Evil, Almost Famous andMonsters Inc. Note that all frames from the movies are in

Figure 5. The retrieval performance is evaluated using a largeground truth database (6376 images) with groups of four imagesknown to be taken of the same object, but under differentconditions. Each image in turn is used as query image, and thethree remaining images from its group should ideally be at thetop of the query result. In order to compare against less efficientnon-hierarchical schemes we also use a subset of the databaseconsisting of around 1400 images.

settings with a 1400 image subset of the test images. Thecurves show the distribution of how far the wanted imagesdrop in the query rankings. The points where a largernumber of methods meet the y-axis are given in Table 1.Note especially that the use of a larger vocabulary andalso L1 - norm gives performance improvements over the

Figure 6. Curves showing percentage (y-axis) of the ground truthquery images that make it into the top x percent (x-axis) framesof the query for a 1400 image database. The curves are shownup to 5% of the database size. As discussed in the text, it iscrucial for scalable retrieval that the correct images from thedatabase make it to the very top of the query, since verificationis feasible only for a tiny fraction of the database when thedatabase grows large. Hence, we are mainly interested in wherethe curves meet the y-axis. To avoid clutter, this number isgiven in Table 1 for a larger number of settings. A number ofconclusions can be drawn from these results: A larger vocabularyimproves retrieval performance. L1-norm gives better retrievalperformance than L2-norm. Entropy weighting is important, atleast for smaller vocabularies. Our best setting is method A, whichgives much better performance than the setting used by [17], whichis setting T.

settings used by [17].

The performance with various settings was also testedon the full 6376 image database. It is important to note thatthe scores decrease with increasing database size as thereare more images to confuse with. The effect of the shapeof the vocabulary tree is shown in Figure 7. The effects ofdefining the vocabulary tree with varying amounts of dataand training cycles are investigated in Figure 8.

Figure 10 shows a snapshot of a demonstration of themethod, running real-time on a 40000 image database ofCD covers, some connected to music. We have so fartested the method with a database size as high as 1 millionimages, more than one order of magnitude larger than anyother work we are aware of, at least in this category ofmethod. The results are shown in Figure 9. As we couldnot obtain ground truth for that size of database, the 6376image ground truth set was embedded in a database that alsocontains several movies: The Bourne Identity, The Matrix,Braveheart, Collateral, Resident Evil, Almost Famous andMonsters Inc. Note that all frames from the movies are in

query retrieved

from [Nister and Stewenius, ’07]

RBM predicted labels (63%)

32!RBM 16384-gist

wall

floor

poster

ceiling

door

RBM predicted labels (51%)

32!RBM 16384-gist

flower

RBM predicted labels (72%)

Ground truth neighbors 32!RBM 16384-gist

road

tree

car

RBM predicted labels (47%)

Ground truth neighbors 32!RBM 16384-gist

road

sky

building

car

tree

bed

RBM predicted labels (78%)

Ground truth neighbors 32!RBM 16384-gist

RBM predicted labels (56%)

32!RBM 16384-gist

building

road

tree

car

sky

sidewalk crosswalk

road

mountaintree

car

sky

Input image

Ground truth neighbors

Ground truth neighbors

Ground truth neighbors

Input image

Input image

Input image Input image

Input image

Figure 6. This figure shows six example input images. For each image, we show the first 12 nearest neighbors when using ground truth

semantic distance (see text), using 32bits RBM and the original Gist descriptor (which uses 16384 bits). Below each set of neighbors

we show the LabelMe segmentations of each image. Those segmentations and their corresponding labels are used by a pixel-wise voting

scheme to propose a segmentation and labeling of the input image. The resulting segmentation is shown below each input image. The

number above the segmentation indicates the percentage of pixels correctly labeled. A more quantitative analysis is shown in Fig. 7.

AcknowledgmentsThe authors would like to thank Geoff Hinton and Rus

Salakhutdinov for making their RBM code available online. Fund-ing for this research was provided by NSF Career award (IIS0747120), NGA NEGI- 1582-04-0004, Shell Research and ONR-MURI Grant N00014- 06-1-0734.

References

[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate

nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.

[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan.

Matching words and pictures. JMLR, 3:1107–1135, Feb 2003.

[3] A. Bosch, A. Zisserman, and X. Muoz. Representing shape with a spatial pyra-

mid kernel. In CIVR, 2006.

[4] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image seg-

mentation using expectation-maximization and its application to image query-

ing. PAMI, 24(8):1026–1038, 2002.

[5] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and

trends of the new age. ACM Computing Surveys, page to appear, 2008.

[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, M. Gorkani,

J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image

and video content: the QBIC system. IEEE Computer, 28(9):23–32, 1995.

[7] P. Ghosh, B. Manjunath, and K. Ramakrishnan. A compact image signature for

RTS-invariant image retrieval. In IEE VIE, Sep 2006.

[8] J. Goldberger, S. T. Roweis, R. R. Salakhutdinov, and G. E. Hinton. Neighbor-

hood components analysis. In NIPS, 2004.

[9] M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos at a

glance. In Intl. Conf. Pattern Recognition, volume 1, pages 459–464, 1994.

[10] K. Grauman and T. Darrell. Pyramid match hashing: Sub-linear time indexing

over partial correspondences. In Proc. CVPR, 2007.

[11] J. Hayes and A. Efros. Scene completion using millions of photographs. SIG-

GRAPH, 2007.

[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data

with neural networks. Nature, 313(5786):504–507, July 2006.

[13] I. Kunttu, L. Lepisto, J. Rauhamaa, and A. Visa. Binary histogram in image

classification for retrieval purposes. In Intl. Conf. in Central Europe on Com-

puter Graphics, Visualization and Computer Vision, pages 269–273, 2003.

[14] J. Landre and F. Truchetet. Optimizing signal and image processing applica-

tions using intel libraries. In Proc. of QCAV 2007, page to appear, Le Creusot,

France, May 2007. SPIE.

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyra-

mid matching for recognizing natural scene categories. In CVPR, pages 2169–

2178, 2006.

[16] T. Liu, A. W. Moore, and A. Gray. Efficient exact kNN and non-parametric

classification in high dimensions. In NIPS, 2004.

[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,

60:91–110, 2004.

[18] M. Nascimento and V. Chitkara. Color-based image retrieval using binary sig-

natures. In ACM symposium on Applied computing, pages 687–692, 2002.

[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In

Proc. CVPR, pages 2161–2168, 2006.

[20] S. Obdrzalek and J. Matas. Sub-linear indexing for large scale object recogni-

tion. In BMVC, 2005.

[21] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic repre-

sentation of the spatial envelope. International Journal in Computer Vision,

42:145–175, 2001.

[22] T. Quack, U. Monich, L. Thiele, and B. Manjunath. Cortina: A system for

large-scale, content-based web image retrieval. In ACM Multimedia, 2004.

[23] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a

database and web-based tool for image annotation. Technical Report AIM-

2005-025, MIT AI Lab Memo, September, 2005.

[24] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by

preserving class neighbourhood structure. In AISTATS, 2007.

[25] R. R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR workshop

on Information Retrieval and applications of Graphical Models, 2007.

[26] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter

sensitive hashing. In Proc. ICCV, 2003.

[27] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collec-

tions in 3d. In SIGGRAPH, pages 835–846, 2006.

[28] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report

MIT-CSAIL-TR-2007-024, Computer Science and Artificial Intelligence Lab,

Massachusetts Institute of Technology, 2007.

[29] J. Wang, G. Wiederhold, O. Firschein, and S. Wei. Content-based image index-

ing and searching using daubechies’ wavelets. Int. J. Digital Libraries, 1:311–

328, 1998.

RBM predicted labels (63%)

32!RBM 16384-gist

wall

floor

poster

ceiling

door

RBM predicted labels (51%)

32!RBM 16384-gist

flower

RBM predicted labels (72%)

Ground truth neighbors 32!RBM 16384-gist

road

tree

car

RBM predicted labels (47%)

Ground truth neighbors 32!RBM 16384-gist

road

sky

building

car

tree

bed

RBM predicted labels (78%)

Ground truth neighbors 32!RBM 16384-gist

RBM predicted labels (56%)

32!RBM 16384-gist

building

road

tree

car

sky

sidewalk crosswalk

road

mountaintree

car

sky

Input image

Ground truth neighbors

Ground truth neighbors

Ground truth neighbors

Input image

Input image

Input image Input image

Input image

Figure 6. This figure shows six example input images. For each image, we show the first 12 nearest neighbors when using ground truth

semantic distance (see text), using 32bits RBM and the original Gist descriptor (which uses 16384 bits). Below each set of neighbors

we show the LabelMe segmentations of each image. Those segmentations and their corresponding labels are used by a pixel-wise voting

scheme to propose a segmentation and labeling of the input image. The resulting segmentation is shown below each input image. The

number above the segmentation indicates the percentage of pixels correctly labeled. A more quantitative analysis is shown in Fig. 7.

AcknowledgmentsThe authors would like to thank Geoff Hinton and Rus

Salakhutdinov for making their RBM code available online. Fund-ing for this research was provided by NSF Career award (IIS0747120), NGA NEGI- 1582-04-0004, Shell Research and ONR-MURI Grant N00014- 06-1-0734.

References

[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate

nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.

[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan.

Matching words and pictures. JMLR, 3:1107–1135, Feb 2003.

[3] A. Bosch, A. Zisserman, and X. Muoz. Representing shape with a spatial pyra-

mid kernel. In CIVR, 2006.

[4] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image seg-

mentation using expectation-maximization and its application to image query-

ing. PAMI, 24(8):1026–1038, 2002.

[5] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and

trends of the new age. ACM Computing Surveys, page to appear, 2008.

[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, M. Gorkani,

J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image

and video content: the QBIC system. IEEE Computer, 28(9):23–32, 1995.

[7] P. Ghosh, B. Manjunath, and K. Ramakrishnan. A compact image signature for

RTS-invariant image retrieval. In IEE VIE, Sep 2006.

[8] J. Goldberger, S. T. Roweis, R. R. Salakhutdinov, and G. E. Hinton. Neighbor-

hood components analysis. In NIPS, 2004.

[9] M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos at a

glance. In Intl. Conf. Pattern Recognition, volume 1, pages 459–464, 1994.

[10] K. Grauman and T. Darrell. Pyramid match hashing: Sub-linear time indexing

over partial correspondences. In Proc. CVPR, 2007.

[11] J. Hayes and A. Efros. Scene completion using millions of photographs. SIG-

GRAPH, 2007.

[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data

with neural networks. Nature, 313(5786):504–507, July 2006.

[13] I. Kunttu, L. Lepisto, J. Rauhamaa, and A. Visa. Binary histogram in image

classification for retrieval purposes. In Intl. Conf. in Central Europe on Com-

puter Graphics, Visualization and Computer Vision, pages 269–273, 2003.

[14] J. Landre and F. Truchetet. Optimizing signal and image processing applica-

tions using intel libraries. In Proc. of QCAV 2007, page to appear, Le Creusot,

France, May 2007. SPIE.

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyra-

mid matching for recognizing natural scene categories. In CVPR, pages 2169–

2178, 2006.

[16] T. Liu, A. W. Moore, and A. Gray. Efficient exact kNN and non-parametric

classification in high dimensions. In NIPS, 2004.

[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,

60:91–110, 2004.

[18] M. Nascimento and V. Chitkara. Color-based image retrieval using binary sig-

natures. In ACM symposium on Applied computing, pages 687–692, 2002.

[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In

Proc. CVPR, pages 2161–2168, 2006.

[20] S. Obdrzalek and J. Matas. Sub-linear indexing for large scale object recogni-

tion. In BMVC, 2005.

[21] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic repre-

sentation of the spatial envelope. International Journal in Computer Vision,

42:145–175, 2001.

[22] T. Quack, U. Monich, L. Thiele, and B. Manjunath. Cortina: A system for

large-scale, content-based web image retrieval. In ACM Multimedia, 2004.

[23] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a

database and web-based tool for image annotation. Technical Report AIM-

2005-025, MIT AI Lab Memo, September, 2005.

[24] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by

preserving class neighbourhood structure. In AISTATS, 2007.

[25] R. R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR workshop

on Information Retrieval and applications of Graphical Models, 2007.

[26] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter

sensitive hashing. In Proc. ICCV, 2003.

[27] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collec-

tions in 3d. In SIGGRAPH, pages 835–846, 2006.

[28] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report

MIT-CSAIL-TR-2007-024, Computer Science and Artificial Intelligence Lab,

Massachusetts Institute of Technology, 2007.

[29] J. Wang, G. Wiederhold, O. Firschein, and S. Wei. Content-based image index-

ing and searching using daubechies’ wavelets. Int. J. Digital Libraries, 1:311–

328, 1998.

from [Torralba et al., ’08]

(a)

(b)

(c)

(d)

Figure 4. Examples of searching the 5K dataset for: (a) All Soul’s College. (b) Bridge of sighs, Hertford College. (c) Ashmolean Museum.

(d) Bodleian window. The query is shown on the left, with selected top ranked retrieved images shown to the right. All results displayed

are returned before the first false positive for each query.

Acknowledgements. We thank David Lowe for discussionsand for providing his k-d tree code and Henrik Steweniusfor providing his dataset for comparison. We are gratefulfor support from the Royal Academy of Engineering, theEU Visiontrain Marie-Curie network, the EPSRC and Mi-crosoft.

References[1] http://www.robots.ox.ac.uk/!vgg/data/.

[2] http://www.vis.uky.edu/!stewe/ukbench/data/.

[3] http://www.flickr.com/.

[4] Y. Aasheim, M. Lidal, and K. Risvik. Multi-tier architecture

for web search engines. In Proc. Web Congress, 2003.

[5] Y. Amit and D. Geman. Shape quantization and recognition

with randomized trees. Neural Computing, 9(7):1545–1588,

1997.

[6] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu.

An optimal algorithm for approximate nearest neighbor

searching fixed dimensions. Journal of the ACM, 45(6):891–

923, 1998.

[7] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information

Retrieval. ACM Press, ISBN: 020139829, 1999.

[8] L. Barroso, J. Dean, and U. Holzle. Web search for a planet:

The google cluster architecture. Micro, IEEE, 23, 2003.

[9] O. Chum, J. Matas, and S. Obdrzalek. Enhancing RANSAC

by generalized model optimization. In Proc. ACCV, 2004.

[10] C. Elkan. Using the triangle inequality to accelerate kmeans,

2003.

[11] V. Ferrari, T. Tuytelaars, and L. Van Gool. Simultaneous

object recognition and segmentation by image exploration.

In Proc. ECCV, 2004.

[12] M. A. Fischler and R. C. Bolles. Random sample consensus.

Comm. ACM, 24(6):381–395, 1981.

[13] A. Gersho and R. Gray. Vector quantization and signal com-

pression. Kluwer Academic Publishers, Boston, 1992.

[14] R. I. Hartley and A. Zisserman. Multiple View Geometry

in Computer Vision. Cambridge University Press, ISBN:

0521540518, second edition, 2004.

[15] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-

time keypoint recognition. In Proc. CVPR, June 2005.

[16] D. Lowe. Distinctive image features from scale-invariant

keypoints. IJCV, 60(2):91–110, 2004.

[17] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object

class detection with a generative model. In Proc. CVPR,

2006.

[18] K. Mikolajczyk and C. Schmid. Scale & affine invariant in-

terest point detectors. IJCV, 1(60):63–86, 2004.

[19] F. Moosman, B. Triggs, and F. Jurie. Randomized clustering

forests for building fast and discriminative visual vocabular-

ies. In NIPS, 2006.

[20] D. Nister and H. Stewenius. Scalable recognition with a vo-

cabulary tree. In Proc. CVPR, 2006.

[21] S. Obdrzalek and J. Matas. Sub-linear indexing for large

scale object recognition. In Proc. BMVC., 2005.

[22] U. Shaft, J. Goldstein, and K. Beyer. Nearest neighbours

query performance. Technical report, 1998.

[23] C. Silpa-Anan and R. Hartley. Localization using an image-

map. In Proc. ACRA, 2004.

[24] J. Sivic and A. Zisserman. Video Google: A text retrieval

approach to object matching in videos. In Proc. ICCV, Oct

2003.

[25] J. Winn and A. Criminisi. Object class recognition at a

glance. In In Video Proc. CVPR, 2006.

from [Philbin et al., ’07]

Relation to other tasks

Page 53: ICVSS2011 Selected Presentations

image retrieval object classification

analogies:- large databases- efficient indexing- compact representation

differences:- simple notions of visual relevancy (e.g., near-duplicate, same object instance, same spatial layout)

analogies:- recognition of object classes from a few examples

differences:- classes to recognize are defined a priori- training and recognition time is unimportant- storage of features is not an issue

novel class search

Relation to other tasks

Page 54: ICVSS2011 Selected Presentations

Technical requirements of novel class-search

• The object classifier must be learned on the fly from few examples

• Recognition in the database must have low computational cost

• Image descriptors must be compact to allow storage in memory

Page 55: ICVSS2011 Selected Presentations

State-of-the-art inobject classification

Winning recipe: many features + non-linear classifiers(e.g. [Gehler and Nowozin, CVPR’09])

!"#$%

&'()*+),%%-'.,()*+/%

#"0$%

...

!"#$%&#'()*+&,-)&.&#(#/*01#-2"#*

non-linear decision boundary

Page 56: ICVSS2011 Selected Presentations

Model evaluation on Caltech256

0 5 10 15 20 25 300

5

10

15

20

25

30

35

40

45

number of training examples

accu

racy

(%)

gistphogphog2pissimbow5000

!"#$%&'()*$+',

'"#*"-"*.%+'/$%0.&$1

Page 57: ICVSS2011 Selected Presentations

Model evaluation on Caltech256

0 5 10 15 20 25 300

5

10

15

20

25

30

35

40

45

number of training examples

accu

racy

(%)

gistphogphog2pissimbow5000linear combination

!"#$%&'()*$+',

'"#*"-"*.%+'/$%0.&$1

!"#$%&'()*$+',/$%0.&$'2)(3"#%4)#

Page 58: ICVSS2011 Selected Presentations

!"#$%&'()*$+',

'"#*"-"*.%+'/$%0.&$1

!"#$%&'()*$+',/$%0.&$'2)(3"#%4)#

5)#6+"#$%&'()*$+',/$%0.&$'2)(3"#%4)#'7%898%8':.+4;+$'<$&#$+'

!$%&#"#=>'?@$A+$&'B'5)C)D"#E'FGH

Model evaluation on Caltech256

0 5 10 15 20 25 300

5

10

15

20

25

30

35

40

45

number of training examples

accu

racy

(%)

gistphogphog2pissimbow5000linear combinationnonlinear combination

Page 59: ICVSS2011 Selected Presentations

Multiple kernel combinersClassification output is obtained by combining many features via non-linear kernels:

h(x) =F�

f=1

βf

N�

n=1

kf (x,xn)αn + b

sum over features sum over training examples

!"#$%

&'()*+),%%-'.,()*+/%

#"0$%

...where

Page 60: ICVSS2011 Selected Presentations

Multiple kernel learning (MKL)

1. a linear combination of kernels: 2. the SVM parameters:

A typical example of such a feature fm would be a bag-of-visual-words histogram of the image. Then, the correspond-ing dimensionality dm would be the codebook size used forthe vector quantization step. In the following, we will usethe name feature combination method for all methods whichaddress the feature combination problem.

Kernel methods. The object classification problem is aspecial case of multiclass classification. In computer visionthe problem of learning a multiclass classifier from trainingdata is often addressed by means of kernel methods. Kernelmethods make use of kernel functions defining a measureof similarity between pairs of instances. In the context offeature combination it is useful to associate a kernel to eachimage feature as follows. For a kernel function k betweenreal vectors we define the short-hand notation

km(x, x�) = k(fm(x), fm(x�)),

such that the image kernel km : X × X → R only con-siders similarity with respect to image feature fm. If theimage feature is specific to a certain aspect, say, it only con-siders texture information, then the kernel measures simi-larity only with regard to this aspect. The subscript m ofthe kernel can then be understood as indexing into the set offeatures.

In the following, for notational convenience, we will de-note the kernel response of the m’th feature for a givensample x ∈ X to all training samples xi, i = 1, . . . , Nas Km(x) ∈ RN with

Km(x) = [km(x, x1), km(x, x2), . . . , km(x, xN )]T .

In case x is the i’th training sample, i.e. x = xi, thenKm(x) is simply the i’th column of the m’th kernel matrix.

Feature selection as kernel selection In this paper westudy a class of kernel classifiers that aim to combine sev-eral kernels into a single model. Since we associate imagefeatures with kernel functions, kernel combination/selectiontranslates naturally into feature combination/selection.

A conceptually simple approach is the use of CrossValidation (CV) to select a single kernel from the set{k1, . . . , kF }. Every feature combination method shouldbe able to outperform this baseline method or at least matchits performance if a single feature is sufficient for good clas-sification.

In the following we will present several methods in aunified setting along with their training procedures. Anoverview of the different methods in their multiclass vari-ant can also be found in the Table 1.

3. Methods: BaselinesWe include two simple baseline methods, both of which

combine kernels in a pre-defined deterministic way and sub-sequently use the resulting kernel for SVM training.

3.1. Averaging KernelsArguably the simplest method to combine several ker-

nels is to average them. We define the kernel functionk∗(x, x�) = 1

F

�Fm=1 km(x, x�), which is subsequently

used in a support vector machine (SVM).

Training The only free parameters are the SVM parame-ters. We use CV to estimate the best regularization constant.A multiclass variant is build using a one-versus-all scheme.3.2. Product Kernels

The next baseline method we consider is to combineseveral kernels by multiplication. In this case we usek∗(x, x�) = (

�Fm=1 km(x, x�))1/F as the single kernel in

a SVM.

Training Same as for averaging.

4. Methods: Multiple Kernel LearningAnother approach to perform kernel selection is to learn

a kernel combination during the training phase of the al-gorithm. One prominent instance of this class is MKL. Itsobjective is to optimize jointly over a linear combination ofkernels k∗(x, x�) =

�Fm=1 βmkm(x, x�) and the parame-

ters α ∈ RN and b ∈ R of an SVM.MKL was originally introduced in [1]. For efficiency and

in order to obtain sparse, interpretable coefficients, it re-stricts βm ≥ 0 and imposes the constraint

�Fm=1 βm = 1.

Since the scope of this paper is to access the applicabilityof MKL to feature combination rather than its optimizationpart we opted to present the MKL formulations in a way al-lowing for easier comparison with the other methods. Wewrite its objective function as

minα,β,b

12

F�

m=1

βmαT Kmα (1)

+CN�

i=1

L(yi, b +F�

m=1

βmKm(x)T α)

sb.t.F�

m=1

βm = 1, βm ≥ 0, m = 1, . . . , F,

where L(y, t) = max(0, 1 − yt) denotes the Hinge loss.We compare two different algorithms solving this problemfor their runtime performance, namely SILP [18]2 and Sim-pleMKL [17]3.

The final binary decision function of MKL is of the fol-lowing form

FMKL(x) = sign

�F�

m=1

βm(Km(x)T α + b)

�. (2)

2Available online:www.shogun-toolbox.org/3Available online:mloss.org/software/view/174/

where

Learning a non-linear SVM by jointly optimizing over

[Bach et al., 2004; Sonnenburg et al., 2006; Varma and Ray, 2007]

k∗(x, x�) =F�

f=1

βfkf (x, x�)

minα,β,b

12

F�

f=1

βfαT Kfα + CN�

n=1

L

yn, b +F�

f=1

βfKf (xn)T α

subject toF�

f=1

βf = 1, βf ≥ 0, f = 1, . . . , F

L(y, t) = max(0, 1− yt)Kf (x) = [kf (x, x1), kf (x, x2), . . . , kf (x, xN )]T

Page 61: ICVSS2011 Selected Presentations

LP-β: a two-stage approach to MKL! [Gehler and Nowozin, 2009]

• Classification output of traditional MKL:

1. train each independently → traditional SVM learning

2. optimize over → a simple linear programβ

Two-stage training procedure:

hf (x)

• Classification function of LP-β:

� �� �hf (x)

h(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αfn + bf

hMKL(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αn + b

Page 62: ICVSS2011 Selected Presentations

LP-β for novel-class search?

The LP-β classifier:

Unsuitable for our needs due to:

• large storage requirements (typically over 20K bytes/image)• costly evaluation (requires query-time kernel distance

computation for each test image)• costly training (1+ minute for O(10) training examples)

sum over features sum over training examples

h(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αfn + bf

Page 63: ICVSS2011 Selected Presentations

Classemes: a compact descriptor for

Key-idea: represent each image in terms of its “closeness” to a set of basis classes (“classemes”)

output of a pre-learned LP-β for the c-th basis class

φc(x) = hclassemec(x) =F�

f=1

βcf

N�

n=1

kf (x,xcn)αc

n + bc

Φ(x) = [φ1(x), . . . ,φC(x)]Tx

x

! [Torresani et al., 2010]efficient recognition

� �� �LP-β trained before the creation of the database

trained at query-time

gduck(Φ(x);wduck) = Φ(x)T wduck =C�

c=1

wduckc

F�

f=1

βcf

N�

n=1

kf (x,xcn)αc

n + bc

...Φ(x1) Φ(xN)Query-time learning: train a linear classifier on Φ(x)

training examples of novel class

Page 64: ICVSS2011 Selected Presentations

How this works...• Accurate semantic labels are not required...

• Classeme classifiers are just used as detectors for specific patterns of texture, color, shape, etc.

E!cient Object Category Recognition Using Classemes 777

Table 1. Highly weighted classemes. Five classemes with the highest LP-! weightsfor the retrieval experiment, for a selection of Caltech 256 categories. Some may appearto make semantic sense, but it should be emphasized that our goal is simply to createa useful feature vector, not to assign semantic labels. The somewhat peculiar classemelabels reflect the ontology used as a source of base categories.

!"#$%&'"()*+$ ,-(./+$#"-(.'"0$%/&11"2"1$

%)#3)+4.&'$ !"#$"%& '()*%'+&%*,-.& -,."+&(,/& -)##"-%01#"& $2330/+&(,/&

05%6$ 1)$1"*+&(#,/"& 1)45+&)3+&6,%"*& '60$$"*& 6,#.0/7& '%*,07!%&

"/6$ 3072*"+&'.,%"*&12##+&$,#"+&!"*4+&

,/0$,#&-,%%#"& 7*,8"'0%"& 4",4+&1)45&

7*-13""$ 6,%"*-*,3%+&'2*3,-"& '-'0+&-,1#"& ,#,*$+&-#)-.& !0/42& '"*80/7+&%*,5&

'*-/)3-'"4898$ -)/8"9+&%!0/7& $0/"4+&,*",& -4(#,5"*& *),'%0/7+&(,/&6"'%"*/+&!"$0'(!"*"+&

("*')/&

#.""/3&**)#$%,.0/7+&-,*"+&)3+&

')$"%!0/7&1,77,7"+&()*%"*& -,/)(5+&-#)'2*"+&)("/& *)60/7+&'!"##&

-)/%,0/"*+&(*"''2*"+&

1,**0"*&

Large-scale recognition benefits from a compact descriptor for each image,for example allowing databases to be stored in memory rather than on disk. Thedescriptor we propose is 2 orders of magnitude more compact than the state ofthe art, at the cost of a small drop in accuracy. In particular, performance of thestate of the art with 15 training examples is comparable to our most compactdescriptor with 30 training examples.

The ideal descriptor also provides good results with simple classifiers, suchas linear SVMs, decision trees, or tf-idf, as these can be implemented to rune!ciently on large databases.

Although a number of systems satisfy these desiderata for object instanceor place recognition [18,9] or for whole scene recognition [26], we argue thatno existing system has addressed these requirements in the context of objectcategory recognition.

The system we propose is a form of classifier combination, the components ofthe proposed descriptor are the outputs of a set of predefined category-specificclassifiers applied to the image. The obvious (but only partially correct) intu-ition is that a novel category, say duck, will be expressed in terms of the outputsof base classifiers (which we call “classemes”), describing either objects similarto ducks, or objects seen in conjunction with ducks. Because these base classi-fier outputs provide a rich coding of the image, simple classifiers such as linearSVMs can approach state-of-the art accuracy, satisfying the requirements listedabove. However, the reason this descriptor will work is slightly more subtle. It isnot required or expected that these base categories will provide useful semanticlabels, of the form water, sky, grass, beak. On the contrary, we work on theassumption that modern category recognizers are essentially quite dumb; so aswimmer recognizer looks mainly for water texture, and the bomber!plane rec-ognizer contains some tuning for “C” shapes corresponding to the airplane nose,and perhaps the “V” shapes at the wing and tail. Even if these recognizers areperhaps overspecialized for recognition of their nominal category, they can stillprovide useful building blocks to the learning algorithm that learns to recognize

Page 65: ICVSS2011 Selected Presentations

Related work

• Attribute-based recognition:

Figure 4: Attribute prediction for across category protocols. On the leftis Leave-one-class-out case for Pascal and on the right is attribute predic-tion for Yahoo set. Only attributes relevant to these tasks are displayed.Classes are different during training and testing, thus we have across cat-egory generalization issues. Some attributes on the left, like “engine”,“snout”, and “furry”, generalize well, some do not. Feature selection helpsconsiderably for those attributes, like “taillight”, “cloth”, and “rein” thathave problem generalizing across classes. Similar to leave one class outcase, learning attributes on Pascal08 train set and testing them on Yahooset involves across category generalization, right plot. We can, in fact, pre-dict attributes for new classes fairly reliably. Some attributes, like “wing”,“door”, “headlight”, and “taillight”, do not generalize well. Feature se-lection improves generalization on those attributes. Toward the high endof this curve, where good classifiers sit, feature selection improves predic-tion of attribute with generalization issues and produce similar results forattributes without generalization issues. For better visualization purposeswe sorted the plots based on selected features’ area under ROC curve val-ues.

benefits of our novel feature selection method compared tousing whole features.

6.1. Describing Objects

Assigning attributes: There are two main protocols forattribute prediction: “within category” predictions, wheretrain and test instances are drawn from the same set ofclasses, and “across category” predictions where train andtest instances are drawn from different sets of classes. Wedo across category experiments using a leave-one-class-outapproach, or a new set of classes on a new dataset. We trainattributes in a-Pascal and test them in a-Yahoo. We measureour performance in attribute predictions by the area underthe ROC curve, mainly because it is invariant to class pri-ors. We can predict attributes for the within category proto-col with the area under the curve of 0.834 (Figure 3).

Figure 4 shows that we can predict attributes fairly re-liably for across category protocols. The plot on the leftshows the leave-one-class-out case on a-Pascal and the ploton the right shows the same curve for a-Yahoo set.

Figure 5 depicts 12 typical images from a-Yahoo set witha subset of positively predicted attributes. These attributeclassifiers are learned on a-Pascal train set and tested on a-Yahoo images. Attributes written in red, with red crosses,are wrong predictions.

Unusual attributes: People tend to make statementsabout unexpected aspects of known objects ([11], p101).An advantage of an attribute based representation is wecan easily reproduce this behavior. The ground truth at-tributes specify which attributes are typical for each class.If a reliable attribute classifier predicts one of these typi-cal attributes is absent, we report that it is not visible inthe image. Figure 6 shows some of these typical attributeswhich are not visible in the image. For example, it is worthreporting when we do not see the “wing” an aeroplane isexpected to have. To qualitatively evaluate this task we re-

Figure 5: This figure shows randomly selected positively predicted at-tributes for 12 typical images from 12 categories in Yahoo set. Attributeclassifiers are learned on Pascal train set and tested on Yahoo set. We ran-domly select 5 predicted attributes from the list of 64 attributes available inthe dataset. Bounding boxes around the objects are provided by the datasetand we are only looking inside the bounding boxes to predict attributes.Wrong predictions are written in red and marked with red crosses.

Figure 6: Reporting the absence of typical attributes. For example, weexpect to see “Wing”in an aeroplane. It is worth reporting if we see apicture of an aeroplane for which the wing is not visible or a picture of abird for which the tail is not visible.

Figure 7: Reporting the presence of atypical attributes. For example, wedon’t expect to observe “skin” on a dining table. Notice that, if we haveaccess to information about object semantics, observing “leaf” in an imageof a bird might eventually yield “The bird is in a tree”. Sometimes ourattribute classifiers are confused by some misleading visual similarities,like predicting “Horn” from the visually similar handle bar of a road bike.

ported 752 expected attributes over the whole dataset whichare not visible in the images. 68.2% of these reports arecorrect when compared to our manual labeling of those re-ports (Figure 6). On the other hand, if a reliable attributeclassifier predicts an attribute which is not expected to bein the predicted class, we can report that, too (Figure 7).For example, birds don’t have a “leaf”, and if we see onewe should report it. To quantitatively evaluate this predic-tion we evaluate 951 of those predictions by hand; 47.3%are correct. There are two important consequences. First,because birds never have leaves, we may be able to exploitknowledge of object semantics to reason that, in this case,the bird is in a tree. Second, because we can localize fea-tures used to predict attributes, we can show what causedthe unexpected attribute to be predicted (Figure 8). For ex-ample, we can sometimes tell where the “metal” is in a pic-

Figure 4: Attribute prediction for across category protocols. On the leftis Leave-one-class-out case for Pascal and on the right is attribute predic-tion for Yahoo set. Only attributes relevant to these tasks are displayed.Classes are different during training and testing, thus we have across cat-egory generalization issues. Some attributes on the left, like “engine”,“snout”, and “furry”, generalize well, some do not. Feature selection helpsconsiderably for those attributes, like “taillight”, “cloth”, and “rein” thathave problem generalizing across classes. Similar to leave one class outcase, learning attributes on Pascal08 train set and testing them on Yahooset involves across category generalization, right plot. We can, in fact, pre-dict attributes for new classes fairly reliably. Some attributes, like “wing”,“door”, “headlight”, and “taillight”, do not generalize well. Feature se-lection improves generalization on those attributes. Toward the high endof this curve, where good classifiers sit, feature selection improves predic-tion of attribute with generalization issues and produce similar results forattributes without generalization issues. For better visualization purposeswe sorted the plots based on selected features’ area under ROC curve val-ues.

benefits of our novel feature selection method compared tousing whole features.

6.1. Describing Objects

Assigning attributes: There are two main protocols forattribute prediction: “within category” predictions, wheretrain and test instances are drawn from the same set ofclasses, and “across category” predictions where train andtest instances are drawn from different sets of classes. Wedo across category experiments using a leave-one-class-outapproach, or a new set of classes on a new dataset. We trainattributes in a-Pascal and test them in a-Yahoo. We measureour performance in attribute predictions by the area underthe ROC curve, mainly because it is invariant to class pri-ors. We can predict attributes for the within category proto-col with the area under the curve of 0.834 (Figure 3).

Figure 4 shows that we can predict attributes fairly re-liably for across category protocols. The plot on the leftshows the leave-one-class-out case on a-Pascal and the ploton the right shows the same curve for a-Yahoo set.

Figure 5 depicts 12 typical images from a-Yahoo set witha subset of positively predicted attributes. These attributeclassifiers are learned on a-Pascal train set and tested on a-Yahoo images. Attributes written in red, with red crosses,are wrong predictions.

Unusual attributes: People tend to make statementsabout unexpected aspects of known objects ([11], p101).An advantage of an attribute based representation is wecan easily reproduce this behavior. The ground truth at-tributes specify which attributes are typical for each class.If a reliable attribute classifier predicts one of these typi-cal attributes is absent, we report that it is not visible inthe image. Figure 6 shows some of these typical attributeswhich are not visible in the image. For example, it is worthreporting when we do not see the “wing” an aeroplane isexpected to have. To qualitatively evaluate this task we re-

Figure 5: This figure shows randomly selected positively predicted at-tributes for 12 typical images from 12 categories in Yahoo set. Attributeclassifiers are learned on Pascal train set and tested on Yahoo set. We ran-domly select 5 predicted attributes from the list of 64 attributes available inthe dataset. Bounding boxes around the objects are provided by the datasetand we are only looking inside the bounding boxes to predict attributes.Wrong predictions are written in red and marked with red crosses.

Figure 6: Reporting the absence of typical attributes. For example, weexpect to see “Wing”in an aeroplane. It is worth reporting if we see apicture of an aeroplane for which the wing is not visible or a picture of abird for which the tail is not visible.

Figure 7: Reporting the presence of atypical attributes. For example, wedon’t expect to observe “skin” on a dining table. Notice that, if we haveaccess to information about object semantics, observing “leaf” in an imageof a bird might eventually yield “The bird is in a tree”. Sometimes ourattribute classifiers are confused by some misleading visual similarities,like predicting “Horn” from the visually similar handle bar of a road bike.

ported 752 expected attributes over the whole dataset whichare not visible in the images. 68.2% of these reports arecorrect when compared to our manual labeling of those re-ports (Figure 6). On the other hand, if a reliable attributeclassifier predicts an attribute which is not expected to bein the predicted class, we can report that, too (Figure 7).For example, birds don’t have a “leaf”, and if we see onewe should report it. To quantitatively evaluate this predic-tion we evaluate 951 of those predictions by hand; 47.3%are correct. There are two important consequences. First,because birds never have leaves, we may be able to exploitknowledge of object semantics to reason that, in this case,the bird is in a tree. Second, because we can localize fea-tures used to predict attributes, we can show what causedthe unexpected attribute to be predicted (Figure 8). For ex-ample, we can sometimes tell where the “metal” is in a pic-

[Farhadi et al., CVPR’09]

Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer

Christoph H. Lampert Hannes Nickisch Stefan Harmeling

Max Planck Institute for Biological Cybernetics, Tubingen, Germany

{firstname.lastname}@tuebingen.mpg.de

Abstract

We study the problem of object classification when train-

ing and test classes are disjoint, i.e. no training examples of

the target classes are available. This setup has hardly been

studied in computer vision research, but it is the rule rather

than the exception, because the world contains tens of thou-

sands of different object classes and for only a very few of

them image, collections have been formed and annotated

with suitable class labels.

In this paper, we tackle the problem by introducing

attribute-based classification. It performs object detection

based on a human-specified high-level description of the

target objects instead of training images. The description

consists of arbitrary semantic attributes, like shape, color

or even geographic information. Because such properties

transcend the specific learning task at hand, they can be

pre-learned, e.g. from image datasets unrelated to the cur-

rent task. Afterwards, new classes can be detected based

on their attribute representation, without the need for a new

training phase. In order to evaluate our method and to facil-

itate research in this area, we have assembled a new large-

scale dataset, “Animals with Attributes”, of over 30,000 an-

imal images that match the 50 classes in Osherson’s clas-

sic table of how strongly humans associate 85 semantic at-

tributes with animal classes. Our experiments show that

by using an attribute layer it is indeed possible to build a

learning object detection system that does not require any

training images of the target classes.

1. Introduction

Learning-based methods for recognizing objects in natu-

ral images have made large progress over the last years. For

specific object classes, in particular faces and vehicles, reli-

able and efficient detectors are available, based on the com-

bination of powerful low-level features, e.g. SIFT or HoG,

with modern machine learning techniques, e.g. boosting or

support vector machines. However, in order to achieve good

classification accuracy, these systems require a lot of man-

ually labeled training data, typically hundreds or thousands

of example images for each class to be learned.

It has been estimated that humans distinguish between

at least 30,000 relevant object classes [3]. Training con-

ventional object detectors for all these would require mil-

otter

black: yeswhite: nobrown: yesstripes: nowater: yeseats fish: yes

polar bear

black: nowhite: yesbrown: nostripes: nowater: yeseats fish: yes

zebra

black: yeswhite: yesbrown: nostripes: yeswater: noeats fish: no

Figure 1. A description by high-level attributes allows the transfer

of knowledge between object categories: after learning the visual

appearance of attributes from any classes with training examples,

we can detect also object classes that do not have any training

images, based on which attribute description a test image fits best.

lions of well-labeled training images and is likely out of

reach for years to come. Therefore, numerous techniques

for reducing the number of necessary training images have

been developed, some of which we will discuss in Section 3.

However, all of these techniques still require at least some

labeled training examples to detect future object instances.

Human learning is different: although humans can learn

and abstract well from examples, they are also capable of

detecting completely unseen classes when provided with a

high-level description. E.g., from the phrase “eight-sided

red traffic sign with white writing”, we will be able to detect

stop signs, and when looking for “large gray animals with

long trunks”, we will reliably identify elephants. We build

on this paradigm and propose a system that is able to detect

objects from a list of high-level attributes. The attributes

serve as an intermediate layer in a classifier cascade and

they enable the system to detect object classes, for which it

had not seen a single training example.

Clearly, a large number of possible attributes exist and

collecting separate training material to learn an ordinary

classifier for each of them would be as tedious as for all

object classes. But, instead of creating a separate training

[Lampert et al., CVPR’09]

requires hand-specified attribute-class associations

attribute classifiers must be trained with human-labeled examples

Page 66: ICVSS2011 Selected Presentations

Method overview1. Classeme learning

2. Using the classemes for recognition and retrieval

φ”body of water”(x)→ �

training examples of novel class

...Φ(x1) Φ(xN)

gduck(Φ(x)) =C�

c=1

wduckc φc(x)

...φ”walking”(x)→ �

Page 67: ICVSS2011 Selected Presentations

Classeme learning:choosing the basis classes

• Classeme labels desiderata:

- must be visual concepts

- should span the entire space of visual classes

• Our selection: concepts defined in the Large Scale Ontology for Multimedia [LSCOM] to be “useful, observable and feasible for automatic detection”.

2659 classeme labels, after manual elimination of plurals, near-duplicates, and inappropriate concepts

Page 68: ICVSS2011 Selected Presentations

Classeme learning:gathering the training data

• We downloaded the top 150 images returned by Bing Images for each classeme label

• For each of the 2659 classemes, a one-versus-the-rest training set was formed to learn a binary classifier

yes no

φ”walking”(x)

Page 69: ICVSS2011 Selected Presentations

Classeme learning:training the classifiers

• Each classeme classifier is an LP-β kernel combiner [Gehler and Nowozin, 2009]:

• We use 13 kernels based on spatial pyramid histograms computed from the following features:

- color GIST [Oliva and Torralba, 2001]- oriented gradients [Dalal and Triggs, 2009]- self-similarity descriptors [Schechtman and Irani, 2007]- SIFT [Lowe, 2004]

φ(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αf,n + bf

linear combination of feature-specific SVMs

Page 70: ICVSS2011 Selected Presentations

A dimensionality reduction view of classemes

x =

GIST

oriented gradients

self-similarity descriptor

SIFT

φ1(x)

...φ2659(x)

Φ

• 23K bytes/image

• non-linear kernels are needed for good classification

• near state-of-the-art accuracy with linear classifiers

• can be quantized down to <200 bytes/image with almost no recognition loss

Page 71: ICVSS2011 Selected Presentations

Experiment 1: multiclass recognition on Caltech256

LP-β in [Gehler & Nowozin, 2009]using 39 kernels

LP-β with our x

our approach:linear SVM withclassemes Φ(x)

0 10 20 30 40 500

10

20

30

40

50

60

number of training examples

accu

racy (

%)

LPbeta

LPbeta13

MKL

Csvm

Cq1svm

Xsvm

linear SVM with x

linear SVM withbinarized classemes, i.e. (Φ(x) > 0)

Page 72: ICVSS2011 Selected Presentations

LPbeta Csvm0

500

1000

1500

tim

e (

min

ute

s)

Computational cost comparison

Training time Testing time

LPbeta Csvm0

10

20

30

40

tim

e (

ms)

23 hours

9 minutes

Page 73: ICVSS2011 Selected Presentations

10 15 20 25 30 35 40 4510

0

101

102

103

104

accuracy (%)

co

mp

actn

ess (

ima

ge

s p

er

MB

)

LPbeta13

Csvm

Cq1svm

nbnn [Boiman et al., 2008]

emk [Bo and Sminchisescu, 2008]

Xsvm

Accuracy vs. compactness

Lines link performance at 15 and 30 training examples

188 bytes/image

2.5K bytes/image

23K bytes/image

128K bytes/image

Page 74: ICVSS2011 Selected Presentations

Experiment 2: object class retrievalE!cient Object Category Recognition Using Classemes 787

0 10 20 30 40 500

5

10

15

20

25

30

Number of training images

Pre

cisi

on

@ 2

5

Csvm

Cq1Rocchio (!=1, "=0)

Cq1Rocchio (!=0.75, "=0.15)

Bowsvm

BowRocchio (!=1, "=0)

BowRocchio (!=0.75, "=0.15)

Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match thequery class. Random performance is 0.4%.

We consider two di!erent retrieval methods. The first method is a linear SVMlearned for each of the Caltech classes using the one-vs-all strategy. We comparethese classifiers to the Rocchio algorithm [15], which is a classic informationretrieval technique for implementing relevance feedback. In order to use thismethod we represent each image as a document vector d(x). In the case of theBOW model, d(x) is the traditional tf-idf-weighted histogram of words. In thecase of classemes instead, we define d(x)i = [!i(x) > 0]·idfi, i.e. d(x) is computedby multiplying the binarized classemes by their inverted document frequencies.Given, a set of relevant training images Dr, and a set of non-relevant examplesDnr, Rocchio’s algorithm computes the document query

q = "1

|Dr|!

xr!Dr

d(xr) ! #1

|Dnr|!

xnr!Dnr

d(xnr) (1)

where " and # are scalar values. The algorithm then retrieves the databasedocuments having highest cosine similarity with this query. In our experiment,we set Dr to be the training examples of the class to retrieve, and Dnr tobe the remaining training images. We report results for two di!erent settings:(", #) = (0.75, 0.15), and (", #) = (1, 0) corresponding to the case where onlypositive feedback is used.

Figure 4 shows that methods using classemes consistently outperform thealgorithms based on traditional BOW features. Furthermore, SVM yields muchbetter precision than Rocchio’s algorithm when using classemes. Note that theselinear classifiers can be evaluated very e"ciently even on large data sets; further-more, they can also be trained e"ciently and thus used in applications requiringfast query-time learning: for example, the average time required to learn a one-vs-all SVM using classemes is 674 ms when using 5 training examples from eachCaltech class.

• Random performance is 0.4%

Prec

isio

n (%

) @

25

• training Csvm takes 0.6 sec with 5*256 training examples

Page 75: ICVSS2011 Selected Presentations

Analogies with text retrieval• Classeme representation of an image:

presence/absence of visual attributes

• Bag-of-word representation of a text-document:

presence/absence of words

Page 76: ICVSS2011 Selected Presentations

Related work• Prior work (e.g., [Sivic & Zisserman, 2003; Nister & Stewenius, 2006;

Philbin et al., 2007]) has exploited a similar analogy for object-instance retrieval by representing images as bag of visual words

Detect interest patches Compute SIFT descriptors [Lowe, 2004]

Quantize descriptors

…..

fre

qu

en

cy

codewords

Represent image as a sparse histogram of visual words

• To extend this methodology to object-class retrieval we need:- to use a representation more suited to object class recognition (e.g. classemes as opposed to bag of visual words)- to train the ranking/retrieval function for every new query-class

Page 77: ICVSS2011 Selected Presentations

Data structures for efficient retrieval

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

Incidence matrix:

docu

men

ts

features f0 f1 f2 f3 f4 f5 f6 f7

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

• very compact: only one bit per feature entry

• enables efficient calculation of as: wT Φ, ∀Φ

i s.t. Φi �=0

wiΦi

Page 78: ICVSS2011 Selected Presentations

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Goal: compute score for all binary vectors in the databasewT Φ, ∀Φ Φ

Page 79: ICVSS2011 Selected Presentations

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

Page 80: ICVSS2011 Selected Presentations

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

Page 81: ICVSS2011 Selected Presentations

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

Page 82: ICVSS2011 Selected Presentations

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

Page 83: ICVSS2011 Selected Presentations

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

Page 84: ICVSS2011 Selected Presentations

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Cost of scoring is linear in the sum of the lengths of inverted lists associated to non-zero weights

Page 85: ICVSS2011 Selected Presentations

Improve efficiency via sparse weight vectors

Key-idea: force w to contain as many zeros as possible

E(w) = R(w) + CN

�Nn=1 L(w; Φn, yn)

Learning objective

regularizer loss function

label ofexample n

classeme vector of example n

• L2-SVM: ,R(w) = wT w L(w; Φn, yn) = max(0, 1− yn(wT Φn))

• Since for small and for large ,

|wi| > w2i wi

|wi| < w2i wi

choosing will tend to

produce a small number of larger

weights and more zero weights

R(w) =�

i |wi|

Tomographic inversion with !1 wavelet penalization 3

w1

w2

d = AWTw

!1-ball: |w1| + |w2| = constant

!2-ball: w21 + w2

2 = constant

w with d = AWTw and smallest !1-norm

w with d = AWTw and smallest !2-norm

w

|w|

w2

Figure 1. Sparsity, !1 minimization and !2 minimization: Left: Because the !1-ball has no bulge, the solution with smallest !1-norm issparser than the solution with smallest !2-norm. Right: A |w| penalization e!ects small coe"cients more and large coe"cients less thanthe (traditional) w2 penalization.

where c(n) is independent of w. This functional has a much simpler form than the original I1(w) because there is no operator

AWT mixing di!erent components of w. The next approximation w

(n+1) is defined by the minimizer of this new functional.

By calculating the derivative of expression (3) with respect to a specific wavelet or scaling coe"cient wi, one finds the following

set of component-by-component equations:

wi !!

WATd + (I ! WA

TAW

T )w(n)"

i+ " sign(wi) = 0, (4)

valid whenever wi "= 0. These equations are solved by distinguishing the two cases wi > 0 and wi < 0; the solution —

corresponding to the minimizer of the surrogate functional I(n)1 (w), and denoted by w

(n+1) — is then found to equal

w(n+1) = S!

#

WATd + (I ! WA

TAW

T )w(n)$

, (5)

where S! is the so-called soft-thresholding operation, i.e. (see Fig. 2, right side)

S! (w) =

%

&

'

w ! " w # "

0 |w| $ "

w + " w $ !",

(6)

performed on each wavelet or scaling coe"cient wi individually. The starting point of the iteration procedure is arbitrary,

e.g. w(0) = 0. Because of the component-wise character of the tresholding, it is straightforward to use di!erent thresholds

"i for di!erent components wi if desired, and in fact we shall use di!erent thresholds "w and "s for the wavelet and scaling

coe"cients in our application. A schematic representation of the idea behind the iteration (5) is given in Fig. 2. We realize

that this iteration converges slowly for ill-conditioned matrices, but we use it here because it is proven to converge to the

solution (Daubechies et al. 2004).

An improvement in convergence can be gained by rescaling the operator A (and rescaling the data d at the same time)

in such a way that the largest eigenvalue of #2A

TA is close to (but smaller than) unity. The iteration corresponding to the

minimization of this new functional is

w(n+1) = S!"2

#

#2WA

Td + (I ! #2

WATAW

T )w(n)$

. (7)

We will also make use of the following two-step procedure: from the outcome m = WTw of the iteration (7), we define new

data d! = 2d ! Am and restart the same iteration with this new data:

w(n+1) = S!"2

#

#2WA

Td! + (I ! #2

WATAW

T )w(n)$

, w(0) = w. (8)

The outcome m = WTw of this second iteration is then the final, regularized reconstruction of the model. For the same value

of "#2, the second step improves the data fit considerably, %d ! Am%2 < %d ! Am%2; hence a given level of final data fit

$2 will, in the two-step procedure, correspond to a higher value of "#2. Because "#2 determines the threshold level, a higher

value will lead to more aggressive thresholding and thus faster convergence to a sparse solution.

The above method will be demonstrated in the next section and compared to a conventional !2-regularization method,

in which the functional

I2(m) = %d ! Am%22 + "%m%2

2 (9)

|wi|w2

i

wi

Page 86: ICVSS2011 Selected Presentations

Improve efficiency via sparse weight vectors

Key-idea: force w to contain as many zeros as possible

E(w) = R(w) + CN

�Nn=1 L(w; Φn, yn)

Learning objective

regularizer loss function

label ofexample n

classeme vector of example n

• L2-SVM: ,R(w) = wT w L(w; Φn, yn) = max(0, 1− yn(wT Φn))

• L1-LR: , L(w; Φn, yn) = log(1 + exp(−ynwT Φn))

• FGM (Feature Generating Machine) [Tan et al., 2010]:

R(w) = wT w , L(w; Φn, yn) = max(0, 1− yn(w ⊙ d)T Φn)

s.t. 1T d ≤ B d ∈ {0, 1}D elementwise product

R(w) =�

i |wi|

Page 87: ICVSS2011 Selected Presentations

Performance evaluation on ImageNet (10M images)

20 40 60 80 100 120 1400

5

10

15

20

25

30

35

Search time per query (seconds)

Pre

cisi

on @

10

(%)

Full inner product evaluation L2 SVMFull inner product evaluation L1 LRInverted index L2 SVMInverted index L1 LR

• Performance averaged over 400 object classes used as queries

• 10 training examples per query class• Database includes 450 images of the query

class and 9.7M images of other classes• Prec@10 of a random classifiers is 0.005%

Each curve is obtained by varying sparsity through C in training objective

E(w) = R(w) + CN

�Nn=1 L(w; Φn, yn)

regularizer loss function

20 40 60 80 100 120 1400

5

10

15

20

25

30

35

Search time per query (seconds)

Prec

isio

n @

10

(%)

! [Rastegari et al., 2011]

Page 88: ICVSS2011 Selected Presentations

Top-k ranking

• Do we need to rank the entire database? - users only care about the top-ranked images

• Key idea: - for each image iteratively update an upper-bound and a lower-bound on the score

- gradually prune images that cannot rank in the top-k

Page 89: ICVSS2011 Selected Presentations

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

! [Rastegari et al., 2011]

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

→ initial upper bound

• Highest possible score:for binary vector s.t. ΦU

u∗ = wT · ΦU

ΦUi = 1 iff wi > 0

(6 in this case)

→ initial lower bound

• Lowest possible score:for binary vector s.t.

ΦLi = 1 iff wi < 0

l∗ = wT · ΦL

ΦL

(-10 in this case)

Page 90: ICVSS2011 Selected Presentations

Top-k pruning ! [Rastegari et al., 2011]

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

• Initialization: for all imagesu∗, l∗

0

upper bound

lower bound

w: [ 3 -2 0 -6 0 3 -2 0 ]

Page 91: ICVSS2011 Selected Presentations

Top-k pruning ! [Rastegari et al., 2011]

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

w: [ 3 -2 0 -6 0 3 -2 0 ]

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0

• Load feature i • Since wi = +3 (>0), for each image n:

- subtract +3 from the upper bound if - add +3 to the lower bound if

φn,i = 0φn,i = 1

Page 92: ICVSS2011 Selected Presentations

Top-k pruning ! [Rastegari et al., 2011]

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

φn,i = 0φn,i = 1

w: [ 3 -2 0 -6 0 3 -2 0 ]

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0

• Load feature i • Since wi = -2 (<0), for each image n:

- decrement by 2 the upper bound if - increment by 2 the lower bound if

Page 93: ICVSS2011 Selected Presentations

Top-k pruning ! [Rastegari et al., 2011]

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

φn,i = 0φn,i = 1

w: [ 3 -2 0 -6 0 3 -2 0 ]

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0

• Load feature i • Since wi = -6 (<0), for each image n:

- decrement by 6 the upper bound if - increment by 6 the lower bound if

Page 94: ICVSS2011 Selected Presentations

Top-k pruning ! [Rastegari et al., 2011]

• Suppose k = 4:we can prune I2,I9 since they cannot rank in the top-k

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

w: [ 3 -2 0 -6 0 3 -2 0 ]

Page 95: ICVSS2011 Selected Presentations

Distribution of weights and pruning rate

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

ICCV#1745

ICCV#1745

ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

Dimension

Dis

tribu

tion

of a

bsol

ute

wei

ght v

alue

s

L1−LRL2−SVMFGM

a 0 500 1000 1500 2000 25000

20

40

60

80

100

Number of iterations (d)

% o

f im

ages

pru

ned

TkP L1−LR, k=10TkP L1−LR, k=3000TkP L2−SVM, k=10TkP L2−SVM, k=3000TkP FGM, k=10TkP FGM, k=3000

b

Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster withsparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values of k (k = 10, 3000).

a smaller value of k allows the method to eliminate moreimages from consideration at a very early stage.

We now turn to study the effect of parameters D!, v, won the efficiency and accuracy of AR. Figure 3 shows re-trieval speed and precision obtained by varying v and wfor D! ! {128, 256, 512}. Increasing the dictionary size(w) reduces the quantization error while raising the quan-tization time: note the slightly better accuracy but highersearch time when we move from parameter setting (D! =512, v = 256, w = 26) to (D! = 512, v = 256, w = 28).The number of sub-blocks (v) critically affects the retrievaltime: reducing v lowers a lot the search time but causes adrop in accuracy. Finally, note how D! impacts the accu-racy since it affects both the number of parameters in theclassifier as well as the projection error: using a large D!

is beneficial for accuracy when v and w are large; however,when there are few cluster centroids or the number of sub-blocks is small, lowering D! improves precision since thismitigates the quantization error.

Finally, we also ran an experiment simulating real-worldusage of an object-class retrieval system where a user mayprovide a positive training set but no negative set. In suchcases one could use a “background” set for the negative ex-amples. Thus, here we used as negative examples for eachquery, n" = 999 randomly chosen images from all 1000categories, thus possibly containing also some true positives(i.e., images of the query class). As expected, we foundthe precisions of the L1-LR and L2-SVM classifiers to benearly unchanged by the few incorrectly labeled examples:precisions at 10 in this case are 18.75% and 22.55%, respec-tively.

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20

Search time per query (seconds)

Prec

isio

n @

10

(%)

D’=512D’=256D’=128

v=128w=28

v=64w=26

v=32w=28

v=128w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=26

v=16w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=28

Figure 3. Effects of parameters D!, v, w on the accuracy andsearch time of AR for the ILSVRC2010 data set. A small v impliesfaster retrieval at the expense of accuracy. Using a larger value forw reduces the quantization error at a small increase in search time.LoweringD! decreases the power of the classifier (VC-dimension)and increases the PCA projection error, thus negatively impactingprecision.

Retrieval results on ImageNet (10M images). We nowpresent results on the 10-million ImageNet dataset [4]which encompasses over 15,000 categories (in our exper-iment we used 15203 classes). We used a subset of 950categories as query classes. For each of these classes wecapped the number of true positives in the database to ben+

test = 450. The total number of distractors for each queryis n"

test = 9, 671, 611. We trained classifiers for each querycategory using a training set consisting of n+ = 10 posi-

6

Features considered in descending order of |wi |

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

ICCV#1745

ICCV#1745

ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

Dimension

Dis

tribu

tion

of a

bsol

ute

wei

ght v

alue

s

L1−LRL2−SVMFGM

a 0 500 1000 1500 2000 25000

20

40

60

80

100

Number of iterations (d)

% o

f im

ages

pru

ned

TkP L1−LR, k=10TkP L1−LR, k=3000TkP L2−SVM, k=10TkP L2−SVM, k=3000TkP FGM, k=10TkP FGM, k=3000

b

Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster withsparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values of k (k = 10, 3000).

a smaller value of k allows the method to eliminate moreimages from consideration at a very early stage.

We now turn to study the effect of parameters D!, v, won the efficiency and accuracy of AR. Figure 3 shows re-trieval speed and precision obtained by varying v and wfor D! ! {128, 256, 512}. Increasing the dictionary size(w) reduces the quantization error while raising the quan-tization time: note the slightly better accuracy but highersearch time when we move from parameter setting (D! =512, v = 256, w = 26) to (D! = 512, v = 256, w = 28).The number of sub-blocks (v) critically affects the retrievaltime: reducing v lowers a lot the search time but causes adrop in accuracy. Finally, note how D! impacts the accu-racy since it affects both the number of parameters in theclassifier as well as the projection error: using a large D!

is beneficial for accuracy when v and w are large; however,when there are few cluster centroids or the number of sub-blocks is small, lowering D! improves precision since thismitigates the quantization error.

Finally, we also ran an experiment simulating real-worldusage of an object-class retrieval system where a user mayprovide a positive training set but no negative set. In suchcases one could use a “background” set for the negative ex-amples. Thus, here we used as negative examples for eachquery, n" = 999 randomly chosen images from all 1000categories, thus possibly containing also some true positives(i.e., images of the query class). As expected, we foundthe precisions of the L1-LR and L2-SVM classifiers to benearly unchanged by the few incorrectly labeled examples:precisions at 10 in this case are 18.75% and 22.55%, respec-tively.

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20

Search time per query (seconds)

Prec

isio

n @

10

(%)

D’=512D’=256D’=128

v=128w=28

v=64w=26

v=32w=28

v=128w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=26

v=16w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=28

Figure 3. Effects of parameters D!, v, w on the accuracy andsearch time of AR for the ILSVRC2010 data set. A small v impliesfaster retrieval at the expense of accuracy. Using a larger value forw reduces the quantization error at a small increase in search time.LoweringD! decreases the power of the classifier (VC-dimension)and increases the PCA projection error, thus negatively impactingprecision.

Retrieval results on ImageNet (10M images). We nowpresent results on the 10-million ImageNet dataset [4]which encompasses over 15,000 categories (in our exper-iment we used 15203 classes). We used a subset of 950categories as query classes. For each of these classes wecapped the number of true positives in the database to ben+

test = 450. The total number of distractors for each queryis n"

test = 9, 671, 611. We trained classifiers for each querycategory using a training set consisting of n+ = 10 posi-

6

norm

aliz

ed a

bsol

ute

wei

ght

valu

es

Page 96: ICVSS2011 Selected Presentations

Performance evaluation on ImageNet (10M images)

0 50 100 1500

5

10

15

20

25

30

35

Search time per query (seconds)

Prec

isio

n @

10

(%)

TkP L1−LRTkP L2−SVMInverted index L1−LRInverted index L2−SVM

0 50 100 1500

5

10

15

20

25

30

35

Search time per query (seconds)

Prec

isio

n @

10

(%)

• k = 10• Performance averaged over 400 object

classes used as queries• 10 training examples per query class• Database includes 450 images of the query

class and 9.7M images of other classes• Prec@10 of a random classifiers is 0.005%

Each curve is obtained by varying sparsity through C in training objective

E(w) = R(w) + CN

�Nn=1 L(w; Φn, yn)

regularizer loss function

! [Rastegari et al., 2011]

Page 97: ICVSS2011 Selected Presentations

Alternative search strategy:approximate ranking

• Key-idea: approximate the score function with a measure that can computed (more) efficiently (related to approximate NN search:[Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al., 2008])

• Approximate ranking via vector quantization:

!q(!)

wT Φ ≈ wT q(Φ)

where is a quantizer returning the cluster centroid nearest to

q(.)Φ

• Problem: - to approximate well the score we need a fine quantization- the dimensionality of our space is D=2659: too large to enable a fine quantization using k-means clustering

Page 98: ICVSS2011 Selected Presentations

Product quantization! [Jegou et al., 2011]

• Split feature vector ! into v subvectors: ! " [ !1 | !2 | ... | !v ]

• Subvectors are quantized separately by quantizers

q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ]where each qi(.) is learned in a space of dimensionality D/v

• Example from [Jegou et al., 2011]: ! is a 128-dimensional vector split into 8 subvectors of dimension 16

!1 !2 !3 !4 !5 !6 !7 !8

q1

q1(!1)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

16 components

28 = 256 centroids

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

8 bits

q2

q2(!2)

q3

q3(!3)

q4

q4(!4)

q5

q5(!5)

q6

q6(!6)

q7

q7(!7)

q8

q8(!8)

⇒ 64-bit quantization index

q1

q1(!1)

q1

q1(!1)

q1

q1(!1)

q1

q1(!1)

Page 99: ICVSS2011 Selected Presentations

Efficient approximate scoring

wT Φ ≈ wT q(Φ) =v�

j=1

wTj qj(Φj)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

can be precomputed and stored in alook-up table

w1

w2...

wv

1.Filling the look-up table:

sub-

bloc

ks

centroids (r per sub-block)

Page 100: ICVSS2011 Selected Presentations

Efficient approximate scoring

wT Φ ≈ wT q(Φ) =v�

j=1

wTj qj(Φj)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

can be precomputed and stored in alook-up table

w1

w2...

wv

1.Filling the look-up table:

quantization for sub-block 1:

s11

sub-

bloc

ks

centroids (r per sub-block)

inner product

Page 101: ICVSS2011 Selected Presentations

Efficient approximate scoring

wT Φ ≈ wT q(Φ) =v�

j=1

wTj qj(Φj)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

can be precomputed and stored in alook-up table

w1

w2...

wv

1.Filling the look-up table:

quantization for sub-block 1:

s11 s12

sub-

bloc

ks

centroids (r per sub-block)

inner product

Page 102: ICVSS2011 Selected Presentations

Efficient approximate scoring

wT Φ ≈ wT q(Φ) =v�

j=1

wTj qj(Φj)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

can be precomputed and stored in alook-up table

w1

w2...

wv

1.Filling the look-up table:

quantization for sub-block 1:

s11 s12 s13 ... ... ... ... ... ... s1r

sub-

bloc

ks

centroids (r per sub-block)

inner product

Page 103: ICVSS2011 Selected Presentations

Efficient approximate scoring

wT Φ ≈ wT q(Φ) =v�

j=1

wTj qj(Φj)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

can be precomputed and stored in alook-up table

w1

w2...

wv

1.Filling the look-up table:

s11 s12 s13 ... ... ... ... ... ... s1r

s21

sub-

bloc

ks

quantization for sub-block 2: centroids (r per sub-block)

inner product

Page 104: ICVSS2011 Selected Presentations

Efficient approximate scoring

wT Φ ≈ wT q(Φ) =v�

j=1

wTj qj(Φj)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

can be precomputed and stored in alook-up table

s11 s12 s13 ... ... ... ... ... ... s1r

s21 s22 s23 ... ... ... ... ... ... s2r

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...sv1 sv2 sv3 ... ... ... ... ... ... svr

sub-

bloc

ks

2.Score each quantized vector in the database using the look-up table:

q(Φ)

wT q(Φ) = wT1 q1(Φ1) + wT

2 q2(Φ2) + . . . + wTv qv(Φv)

wT q(Φ) = wT1 q1(Φ1) + wT

2 q2(Φ2) + . . . + wTv qv(Φv)

centroids (r per sub-block)

Only v additions per image!

Page 105: ICVSS2011 Selected Presentations

Choice of parameters! [Rastegari et al., 2011]

• Dimensionality is first reduced with PCA from D=2659 to D’ < D

• How do we choose D’, v (number of sub-blocks), r (number of centroids per sub-block)?

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20

Search time per query (seconds)

Prec

isio

n @

10

(%)

D’=512D’=256D’=128

(32,28)

(64,26)(64,26)

(16,28)

(128,28)

(32,28)

(16,28)

(32,28)

(64,26)

(128,28)

(256,26)(256,28)

(16,28)

• Effect of parameter choices on a database of 150K images:

(v,r)

Page 106: ICVSS2011 Selected Presentations

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

ICCV#1745

ICCV#1745

ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 0.5 1 1.5 2 2.50

5

10

15

20

25

Search time per query (seconds)

Prec

isio

n @

10

(%)

AR L2−SVMTkP L1−LRTkP L2−SVMTkP FGM

Figure 1. Class-retrieval precision versus search time for theILSVRC2010 data set: x-axis is search time; y-axis shows per-centage of true positives ranked in the top 10 using a databaseof 150,000 images (with n!

test = 149, 850 distractors andn+

test = 150 true positives for each query class). The curve foreach method is obtained by varying parameters controlling theaccuracy-speed tradeoff (see details in the text).

subdivide a vector according to the order of components sothat the j-th sub-block would consist of the consecutive fea-ture entries from position (1 + (j ! 1)D!/v) to (jD!/v).However, such strategy would blindly allocate the samenumber of centroids for the most informative components(the ones in the first sub-block) as well as for the least in-formative. We address this problem using the solution pro-posed in [13]: we apply a random orthogonal transforma-tion after PCA so that the variances of the resulting com-ponents will be more even. We then quantize the examplesand train our retrieval models in this space.

6. ExperimentsIn this section we empirically evaluate the proposed al-

gorithms and the several possible parameter options onchallenging data sets under the performance measures ofretrieval accuracy, speed and memory usage. We denote thetop-k pruning method with TkP and the approximate rank-ing technique with AR.

Retrieval evaluation on ILSVRC2010 (150K images).We first evaluate our methods using the data set ofthe Large Scale Visual Recognition Challenge 2010(ILSVRC2010) [1], which includes images of 1000 differ-ent categories. We use a subset of the ILSVRC2010 trainingset to learn the classifiers: for each of the 1000 classes, wetrain a classifier using n+ = 50 positive examples (i.e., im-ages belonging to the query category) and n" = 999 nega-tive examples obtained by sampling one image from each of

the other classes. To cope with the largely unequal numberof positive and negative examples (n" >> n+) we nor-malize the loss term for each example in eq. 1 by the sizeof its class. We evaluate the learned retrieval models on theILSVRC2010 test set, which includes 150,000 images, with150 examples per category. Thus, the database containsn+

test = 150 true positives and n"

test = 149, 850 distrac-tors for each query. Figure 1 shows precision versus searchtime for AR and TkP in combination with different classi-fication models. Since AR does not use sparsity to achieveefficiency, we only paired it with the L2-SVM model. Thex-axis shows average retrieval time per query, measured ona single-core computer with 16GB of RAM and an IntelCore i7-930 CPU @ 2.80GHz. The y-axis reports precisionat 10 which measures the proportion of true positives in thetop 10. The times reported for TkP were obtained usingk = 10. The curve for AR was generated by varying theparameter choices for v andw, as discussed in further detaillater. The performance curves for “TkP L1-LR” and “TkPL2-SVM” were produced by varying the regularization hy-perparameterC in eq. 1. While C is traditionally viewed ascontrolling the bias-variance tradeoff, in our context it canbe interpreted as a parameter balancing generalization ac-curacy versus sparsity, and thus retrieval speed. In the caseof “TkP FGM” we have kept a constant C (tuned by cross-validation), and instead varied the sparsity of this classifierby acting on the separate parameter B. From this figurewe see that AR is overall the fastest method at the expenseof search accuracy: a peak precision of 22.6% is obtainedby TkP using L2-SVM but AR with the same classificationmodel achieves only a top precision of 17.5% due to a com-bination of fewer learning parameters (in this experimentwe used D! = 512), PCA projection error and quantizationerror. As expected, we note that TkP runs faster when usedin combination with L1-LR or FGM rather than L2-SVM,since it benefits from sparsity in the parameter vectors toeliminate images from consideration. However, we see thatsparsity negatively affects accuracy, with L2-SVM provid-ing clearly much better precision compared to L1-LR.

In our experiments we found that TkP typically ex-hibits faster retrieval in conjunction with L1-LR rather thanFGM. We can gain an intuition on the reasons by inspect-ing the average distribution of weight absolute values infigure 2(a). The average distribution for each classifica-tion model was obtained by first sorting the weight abso-lute values for each query in descending order and thennormalizing by the largest absolute value. For this exper-iment we chose B = 1000 for the FGM model. We cansee that although for this setting the weight vectors learnedby FGM are on average more sparse than those producedby L1-LR, the normalized magnitude of the L1-LR weightsdecays much faster. TkP benefits from the presence of thesehighly skewed weight magnitudes to produce more aggres-

5

Performance evaluation on 150K images

• Performance averaged over 1000 object classes used as queries

• 50 training examples per query class

• Database includes 150 images of the query class and 150K images of other classes

• Prec@10 of a random classifiers is 0.1%

approximate ranking

Page 107: ICVSS2011 Selected Presentations

Memory requirements for 10M images

1 2 30

2

4

6

8

9 Gbytes

3 Gbytes

1.8 Gbytes

mem

ory

usag

e

Inverted index

Incidence matrix(used by TkP)

Product quantization index

Page 108: ICVSS2011 Selected Presentations

Conclusions and open questions

• Compact descriptor enabling efficient novel-class recognition (less than 200 bytes/image yet it produces performance similar to MKL at a tiny fraction of the cost)

• Questions currently under investigation:- can we learn better classemes from fully-labeled data?- can we decouple the descriptor size from the number of classeme classes?- can we encode spatial information ([Li et al. NIPS10])?

• Software for classeme extraction available at:http://vlg.cs.dartmouth.edu/projects/classemes_extractor/

Classemes:

Information retrieval approaches to large-scale object-class search:

• sparse representations and retrieval models

• top-k ranking

• approximate scoring

Page 109: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Efficient Novel Class Recognition and Search - LorenzoTorresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 110: ICVSS2011 Selected Presentations

The Life of Structured Learned Dictionaries

Guillermo SapiroUniversity of Minnesota

G. Yu and S. Mallat (Inverse problems via GMM)G. Yu and F. Leger (Matrix completion)G. Yu (Statistical compressed sensing)

A. Castrodad (activity recognition in video)M. Zhou, D. Dunson, and L. Carin (video layers separation)

1

Friday, July 8, 2011

Page 111: ICVSS2011 Selected Presentations

ExamplesZooming

Inpainting

Deblurring

Inverse Problemsy = Uf +w

w ∼ N (0,σ2Id)

f : maskingU

: subsamplingU : convolutionw ∼ N (0,σ2Id)U

3

Friday, July 8, 2011

Page 112: ICVSS2011 Selected Presentations

• High computational complexity.

• Huge numbers of parameters to estimate.

• Behavior not well understood (results starting to appear).

• Dictionary learning

• Non-convex.

Dictionary learning

minD,{ai}1≤i≤I

1≤i≤I

��fi −Dai�2 + λ�ai�1

• Better performance than pre-fixed dictionaries.

Learned Overcomplete Dictionaries

11

Friday, July 8, 2011

Page 113: ICVSS2011 Selected Presentations

Sparse Inverse Problem Estimation

provides a sparse representation for .

provides a sparse representation for .

Sparse estimation of from

Inverse problem estimation

wherey = Uf +w w ∼ N (0,σ2Id)

f

f = Da+ �Λ |Λ| � |Γ| , ��Λ�2 � �f�2Λ = support(a)

D = {φm}m∈Γ

UD = {Uφm}m∈Γ y

y = UDa+ ��Λ with |Λ| � |Γ| , andΛ = support(a)

a ya = argmin

a�UDa− y�2 + λ �a�1

f = Da

• Sparse inverse problem estimation

• Observation

• Sparse prior

with and

12

����2 � �y�2

Friday, July 8, 2011

Page 114: ICVSS2011 Selected Presentations

Structured Representation and Estimation

D B1 B2 B3 B4 B5Overcomplete dictionary Structured overcomplete dictionary

• Dictionary: union of PCAs

• Union of orthogonal bases

• In each basis, the atoms are ordered:

• Piecewise linear estimation (PLE)

• A linear estimator per basis

• Non-linear basis selection: a best linear estimator is selected

• Small degree of freedom, fast computation, state-of-the-art performance

λk1 ≥ λk

2 ≥ · · · ≥ λkN

D = {Bk}1≤k≤K

16

Friday, July 8, 2011

Page 115: ICVSS2011 Selected Presentations

Gaussian Mixture Models

• Estimate from {(µk,Σk)}1≤k≤K

• Identify the Gaussian that generates ,ki ∀i

• Estimate from , N (µki ,Σki) ∀i

{yi}1≤i≤I

fi

fi

18

whereyi = Uifi +wi wi ∼ N (0,σ2Id)

Friday, July 8, 2011

Page 116: ICVSS2011 Selected Presentations

Structured Sparsity• PCA (Principal Component Analysis)

• , eigenvalues.λk1 ≥ λk

2 ≥ · · · ≥ λkN

• PCA transform

• MAP with PCA

fki = Bkaki

Σk = BkSkBTk

Sk = diag(λk1 , . . . ,λ

kN )

aki = argminai

��UiBkai − yi�2 + σ2

N�

m=1

|ai[m]|2

λkm

fki = argminfi

��Uifi − yi�2 + σ2fTi Σ−1

k fi�

22

• PCA basis, orthogonal.Bk = {φkm}1≤m≤N

Friday, July 8, 2011

Page 117: ICVSS2011 Selected Presentations

Structured Sparsity

aki = argminai

��UiBkai − yi�2 + σ2

N�

m=1

|ai[m]|2

λkm

�Piecewise linear estimate Sparse estimate v.s.

D B1 B2 B3 B4 B5

• Nonlinear basis selection, degree of freedom . K

Full degree of freedom in atom selection

�|Γ||Λ|

• Linear collaborative filtering in each basis.

23

ai = argminai

�UDai − yi�2 + λ

|Γ|�

m=1

|ai[m]|

Friday, July 8, 2011

Page 118: ICVSS2011 Selected Presentations

Initial Experiments: Evolution

24

Clustering 1st iteration

Clustering 2nd iteration

Friday, July 8, 2011

Page 119: ICVSS2011 Selected Presentations

Experiments: Inpainting

Original 20% available MCA 24.18 dB ASR 21.84 dB

KR 21.55 dB FOE 21.92 dB BP 25.54 dB PLE 27.65 dB

[Elad, Starck, Querre, Donoho, 05] [Guleryuz, 06]

[Takeda, Farsiu. Milanfar, 06] [Roth and Black, 09] [Zhou, Sapiro, Carin, 10]26

Friday, July 8, 2011

Page 120: ICVSS2011 Selected Presentations

Experiments: Zooming

Original Bicubic 28.47 dB SAI 30.32 dB PLE 30.64 dBSR 23.85 dB

Low-resolution

SAI [Zhang and Wu, 08]SR [Yang, Wright, Huang, Ma, 09]

29

Friday, July 8, 2011

Page 121: ICVSS2011 Selected Presentations

Experiments: Zooming Deblurring

f Uf y = SUf

Iy PLE 30.49 dB SR 28.93 dB29.40 dB

[Yang, Wright, Huang, Ma, 09]

32

Friday, July 8, 2011

Page 122: ICVSS2011 Selected Presentations

Experiments: Denoising

34

Original Noisy 22.10 dB NLmeans 28.42 dB

FOE 25.62 dB BM3D 30.97 dB PLE 31.00 dB

[Buades et al, 06]

[Roth and Black, 09] [Dabov et al, 07]

Friday, July 8, 2011

Page 123: ICVSS2011 Selected Presentations

Summary of this part

• Gaussian mixture models and MAP-EM work well for image inverse problems.

• Piecewise linear estimation, connection to structured sparsity.

• Nonlinear best basis selection, small degree of freedom.

• Faster computation than sparse estimation.

• Results in the same ballpark of the state-of-the-art.

• Beyond images: recommender systems and audio (Sprechmann & Cancela)

• Statistical compressed sensing38

• Collaborative linear filtering.

Friday, July 8, 2011

Page 124: ICVSS2011 Selected Presentations

Modeling  and  Learning  Human  Ac2vity  

 

Alexey  Castrodad1,2  and  Guillermo  Sapiro2  1  NGA  Basic  and  Applied  Research  

2  University  of  Minnesota,  ECE  Department  [email protected]  ,  [email protected]    

   

Page 125: ICVSS2011 Selected Presentations

Mo2va2on  •  Problem:            Given  volumes  of  video  feed,  detect  ac2vi2es  of  interest  

§  Mostly  done  manually!  •  Solving  this  will:  

§  Aid  the  operator:  surveillance/security,  gaming,  psychological  research  

§  SiV  through  large  amounts  of  data  •  Solu2on:    Fully/semi-­‐automa2c  ac2vity  detec2on  with  

minimum  human  interac2on  §  Invariance  to  spa2al  transforma2ons  §  Robust  to  occlusions,  low  resolu2on,  noise  §  Fast  and  accurate  §  Simple,  generic  

 4  

Page 126: ICVSS2011 Selected Presentations

Sparse  modeling:    Dic2onary  learning  from  data  

7  

Page 127: ICVSS2011 Selected Presentations

Sparse  modeling  for  ac2on  classifica2on:  Phase  1  

•     Input  Videos  

•     Spa2al  Temporal  Features    

•     Sparse  Modeling    

9  

•       l1  Pooling  

Classifier  output  

D1   D2   D3   D  

A1  A2  A3  

New  video  

Feature  Extrac2on  

Sparse  coding  

Classifica2on  

Training  Class  1   Class  2   Class  3  

Page 128: ICVSS2011 Selected Presentations

Sparse  modeling  for  ac2on  classifica2on:  Phase  2  

•       Sparse  Modeling  

•     Inter-­‐class  Modeling  

10  

•       l1  Pooling  

D1   D2   D3   D  

Training  Videos  

Feature  Extrac2on  

Sparse  coding  

Training  

Classifica2on  

New  video  

Feature  Extrac2on  

Sparse  coding  

E1   E2   E3  

Sparse  Coding  

A1  A2  A3  

Classifier  output  from  Phase  1  

Page 129: ICVSS2011 Selected Presentations

Results  •  YouTube  Ac2on  Dataset  

§  variable  spa2al  resolu2on  videos,  3-­‐8  seconds  each  §  11  types  of  ac2ons  from  YouTube  videos  

18  

Scene   AcGons   Camera   ResoluGon   Frame  Rate  

indoors/outdoors   basketball  shoo2ng,  cycling,  diving,  golf  swinging,  horse  back  riding,  soccer  juggling,  swinging,  tennis  swinging,  trampoline  jumping,  volleyball  spiking,  walking  with  a  dog  

jiaer,  scale  varia2ons,  camera  mo2on,  variable  illumina2on  condi2ons,  high  background  cluaer  

variable,  resampled  to    320  x  240  

25  fps  

Page 130: ICVSS2011 Selected Presentations

Results:  YouTube  Ac2on  Dataset  §  Best/recent  reported:  75.8%  (Q.V.  Le  et  al.,  2011);  84.2%  

(Wang  et  al.,  2011)    §  Recogni2on  rate:  80.29  %  (phase  1)  and  91.9%  (phase  2)    

20  

Page 131: ICVSS2011 Selected Presentations

Conclusion  •  Main  contribu2on:    

§  Robust  ac2vity  recogni2on  framework  based  on  sparse  modeling  

§  Generic:    works  on  mul2ple  data  sources  §  State-­‐of-­‐the-­‐art  results  in  all  of  them,  same  parameters  

•  Key  advantage:  §  Simplicity,  state  of  the  art  results  §  Fast  and  accurate:  7.5  fps  §  7  frames  needed  for  detec2on    

•  Future  direc2on:  §  Exploit  human  interac2ons  §  Infer  the  ac2ons  §  Foreground  extrac2on/video  analysis  for  ac2vity  clustering  

21  

Page 132: ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Efficient Novel Class Recognition and Search - LorenzoTorresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 133: ICVSS2011 Selected Presentations

Shift-Map Image EditingYael Pritch

Eitam Kav-VenakiShmuel Peleg

The Hebrew University of Jerusalem

Page 134: ICVSS2011 Selected Presentations
Page 135: ICVSS2011 Selected Presentations

Retargeting (Avidan and Shamir SIGGRAPH’07, Wolf et al., ICCV’07, Wang et al., SIGASIA’08, Rubinstein et al., SIGGRAPH’08, Rubinstein et al.,SIGGRAPH’09)

Input

Geometrical Image Editing:Retargeting

Shift-MapOutput

Page 136: ICVSS2011 Selected Presentations

Geometrical Image Editing:Inpainting

Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

InputMask

Page 137: ICVSS2011 Selected Presentations

OutputMask

Geometrical Image Editing:Inpainting

Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

Page 138: ICVSS2011 Selected Presentations

Shift-Map Composition

User Constraints

A B C D

Page 139: ICVSS2011 Selected Presentations

Shift-Map Composition

User Constraints

A B C D

A

Page 140: ICVSS2011 Selected Presentations

Shift-Map Composition

User Constraints

A B C D

A B

Page 141: ICVSS2011 Selected Presentations

Shift-Map Composition

User Constraints

A B C D

A BC

Page 142: ICVSS2011 Selected Presentations

Shift-Map Composition

User Constraints

A B C D

A B DCNo accuratesegmentation

required

Page 143: ICVSS2011 Selected Presentations

Shift-Map Composition

User Constraints

A B C D

No accuratesegmentation

required

Page 144: ICVSS2011 Selected Presentations

• Shift-Maps represent a mapping for each pixel in the output image into the input image

• The color of the output pixel is copied from corresponding input pixel

Our Approach : Shift-Map

Output : R(u,v) Input : I(x,y)

Page 145: ICVSS2011 Selected Presentations

• We use relative mapping coordinate (like in Optical Flow)

Our Approach : Shift-Map

(u,v)

(x,y)

Output : R(u,v) Input : I(x,y)

(u,v)

Page 146: ICVSS2011 Selected Presentations

Ty = 10

Our Approach : Shift-Map

Vertical ShiftsHorizontal Shifts

Shift-MapOutput Image

Input

• Minimal distortion• Adaptive boundaries• Fast optimization

Tx = 50

Tx = 0Tx = 400

Ty = 10

Tx = 50

Output

Page 147: ICVSS2011 Selected Presentations

• We look for the optimal mapping - can be described as an Energy Minimization problem

Smoothness term :Avoid Stitching Artifacts

Data term : External Editing Requirement

Geometric Editing as an Energy Minimization

Compute For Each Pixel Compute For Each Pair of Neighboring pixels

• Unified representation for geometric editing applications• Solved using a graph labeling algorithm

Page 148: ICVSS2011 Selected Presentations

The Smoothness Term

color

gradient

q’

p’ np’

nq’

R - Output Image I - Input Image

For p For q

(Kwatra et al. 03, Agarwala et al. 04)

p qDiscontinuity in the shift-map

Page 149: ICVSS2011 Selected Presentations

• Data term varies between different application• Inpainting data term uses data mask D(x,y) over the

input image– D(x,y)= ∞ for pixels to be removed– D(x,y)=0 elsewhere

• Specific input pixels can be forced not to be included in the output image by setting D(x,y)=∞

The Data Term: Inpainting

(x,y)

0=D

(u,v)

Page 150: ICVSS2011 Selected Presentations

• Input pixels can be forced to appear in a new location

(u,v)

(x,y)

• Appropriate shift gets infinitely low energy

• Other shifts getinfinitely high energy

The Data Term: Rearrangement

Page 151: ICVSS2011 Selected Presentations

• Use picture borders• Can incorporate importance mask

– Order constraint on mapping is applied to prevent duplications of important areas

The Data Term: Retargeting

Page 152: ICVSS2011 Selected Presentations

• Minimal energy mapping can be represented as graph labeling where the Shift-Map value is the selected labelfor each output pixel

• Labels: relative shift

Shift-Map as Graph Labeling

Output image pixels Input image

Shift Map:assign

a label to each pixel

Nodes:pixels

Labels: shift-map values (tx,ty)

Page 153: ICVSS2011 Selected Presentations

Hierarchical SolutionGaussian pyramid

on input

Shift-Map

Shift-Map

Output

Page 154: ICVSS2011 Selected Presentations

Shift-Map handles without additional user interaction some cases where other algorithms suggested that can only be handled with additional user guidance

Results and Comparison

J. Sun, L. Yuan, J. Jia, and H. Shum. Image completion with structure propagation. In SIGGRAPH’05

Shift-Map

Image completion with structure propagation [Sun et al. SIGGRAPH’05]

Mask

Page 155: ICVSS2011 Selected Presentations

Application: Retargeting

Input Output

Page 156: ICVSS2011 Selected Presentations

Non-Homogeneous[Wolf et al., ICCV’07]

PatchMatch[Barnes et al, SIGGRAPH‘09]

Improved Seam Carving [Robinstein et al, SIGGRAPH’08] Shift-Maps

Results andComparison

Page 157: ICVSS2011 Selected Presentations

Summary

• New representation to geometrical editing applications as an optimal graph labeling

• Unified approach

• Solved efficiently using hierarchical approximations

• Minimal user interaction is required for various editing tasks

Page 158: ICVSS2011 Selected Presentations

• Build an Output image R from pixels taken from Source image I such that R is most similar to Target image T

Similarity Guided Composition

Source ImageTarget Image Output

Page 159: ICVSS2011 Selected Presentations

• Data term reflects a similarity between the output image R and a target image T

• Similarity uses both colors and gradients

Similarity Guided Composition

Page 160: ICVSS2011 Selected Presentations

• Data term indicates the similarity of the output image to the target image

• Weight between similarity and smoothness has the following effect

Source ImageTarget Image

ResultedOutput

Previous Work: Efros and Freeman 2001, Hertzman et al. 2001

Similarity Guided Composition

Page 161: ICVSS2011 Selected Presentations

Edge Preserving Magnification

Using the original image as the source, similarity guided composition can magnify

Does not work for gradual color changes

Source Target (bilinear magnification)

Result

Page 162: ICVSS2011 Selected Presentations

Edge Preserving Magnification

Original image can be the source for edge areas. Otherwise the magnified image is the source.

Source 1Magnified Target

Source 2Original Edge Map

Page 163: ICVSS2011 Selected Presentations

Edge Preserving Magnification

Bicubic Shift Map

Page 164: ICVSS2011 Selected Presentations

Easy to compose (recover) source from target&

Easy to compose (recover) target from source

?

source target

The Bidirectional Similarity[Simakov, Caspi, Shechtman, Irani – CVPR’2008]

Completeness

⊆All source patches (at multiple scales) should be in the target

⊇ All target patches (at multiple scales) should be in the sourceCoherence

Page 165: ICVSS2011 Selected Presentations

• It will he hard to reconstruct back the Fish

• Shift-Map retargeting maximize the coherence

Shift-Map Retargeting with Feedback

Page 166: ICVSS2011 Selected Presentations

• Increase the Appearance Data Term of input regions with a high Composition Score E<A|B> and recompute the output B.

• Pixels with the higher Appearance Term will now appear in the output and increase the completeness.

Shift-Map Retargeting with Feedback

Page 167: ICVSS2011 Selected Presentations

Original

Retargeted Reconstruction of Original

Appearance Term

E<A|B>E<A|B>

Page 168: ICVSS2011 Selected Presentations

FeedbackOriginal Shift-Map

Shift-Map Retargeting with Feedback

Page 169: ICVSS2011 Selected Presentations

Video Synopsis and IndexingMaking a Long Video Short

• 11 million cameras in 2008• Expected 30 million in 2013• Recording 24 hours a day, every day

Page 170: ICVSS2011 Selected Presentations

t

Video SynopsisShift Objects in Time

Input Video I(x,y,t)

Synopsis VideoS(x,y,t)

Page 171: ICVSS2011 Selected Presentations

• Detect and track objects, store in database.• Select relevant objects from database• Display selected objects in a very short

“Video Synopsis”• In “Video Synopsis”, objects from different

times can appear simultaneously• Index from selected objects into original video• Cluster similar objects

Steps in Video Synopsis

Page 172: ICVSS2011 Selected Presentations

Two ClustersCars

People

Camera in St. Petersburg

• Detect specific events• Discover activity patterns

Page 173: ICVSS2011 Selected Presentations

ICVSS 2011 Presentations

168.176.61.22/comp/buzones/PROCEEDINGS/ICVSS2011

Jiri Matas - Tracking, Learning, Detection, Modeling

Ivan Laptev - Human Action Recognition

Josef Sivic - Large Scale Visual Search

Andrew Fitzgibbon - Computer Vision: Truth and Beauty(Kinect)

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations

Page 174: ICVSS2011 Selected Presentations

The end...

Thanks !

Angel Cruz-Roa [email protected] Rueda-Olarte [email protected]

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations