ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

ICVSS 2011: Selected Presentations

Angel Cruz and Andrea Rueda

BioIngenium Research Group, Universidad Nacional de Colombia

August 25, 2011

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations


Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Efficient Novel Class Recognition and Search - LorenzoTorresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg



Outline

1 ICVSS 2011







ICVSS 2011International Computer Vision Summer School

15 speakers, from USA, France, UK, Italy, Prague and Israel









Outline

1 ICVSS 2011






A Trillion Photos

Steve SeitzUniversity of Washington

Google

Sicily Computer Vision Summer SchoolJuly 11, 2011

Facebook

>3 billion uploaded each month

~ trillion photos taken each year

What do you do with a trillion photos?

Digital Shoebox(hard drives, iphoto, facebook...)

?

Comparing images

Detect features using SIFT [Lowe, IJCV 2004]

Comparing images

Extraordinarily robust image matching– Across viewpoint (~60 degree out-of-plane rotations)

– Varying illumination

– Real-time implementations

Edges

Scale Invariant Feature Transform

Adapted from slide by David Lowe

0 2πangle histogram

NASA Mars Rover images

NASA Mars Rover imageswith SIFT feature matchesFigure by Noah Snavely

St. Peters (inside)

Trevi Fountain

St. Peters (outside)

Il Vittoriano

Coliseum(inside)

Coliseum(outside)

Forum

Structure from motion

3D structureMatched photos

Structure from motion

Camera 1

Camera 2

Camera 3R1,t1

R2,t2

R3,t3

p1

p4

p3

p2

p5

p6

p7

minimizef (R, T, P)

aka “bundle adjustment” (texts: Zisserman; Faugeras)

?

Reconstructing RomeIn a day...

From ~1M imagesUsing ~1000 cores

Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitzhttp://grail.cs.washington.edu/rome

Rome 150K: Colosseum

Rome: St. Peters

Venice (250K images)

Venice: Canal

Dubrovnik

Sparse output from the SfM system

From Sparse to Dense

From Sparse to Dense

Furukawa, Curless, Seitz, Szeliski, CVPR 2010

Most of our photos don’t look like this

recognition + alignment

Your Life in 30 Seconds

path optimization

Picasa Integration• As “Face Movies” feature in v3.8

– Rahul Garg, Ira Kemelmacher

Conclusion

trillions of photos + computer vision breakthroughs

= new ways to see the world


Outline

1 ICVSS 2011






Efficient Novel-Class Recognition and Search

Lorenzo Torresani

• no text/tags available

• query images may represent a novel class

user-provided imagesof an object class

+

image database(e.g., 1 million photos)

• Given:

• Want:

database images

of this class

Problem statement: novel object-class search

Application: Web-powered visual search in unlabeled personal photos

1 Search the Web for images of “soccer camp”

2 Find images of this visual class on my computer

Find “soccer camp” pictures on my computer

Goal:

1

2

Application: product search

• Search of aesthetic products

image retrieval object categorization

novel class search

analogies:- large databases- efficient indexing- compact representation

differences:- simple notions of visual relevancy (e.g., near-duplicate, same object instance, same spatial layout)

Figure 5. The retrieval performance is evaluated using a largeground truth database (6376 images) with groups of four imagesknown to be taken of the same object, but under differentconditions. Each image in turn is used as query image, and thethree remaining images from its group should ideally be at thetop of the query result. In order to compare against less efficientnon-hierarchical schemes we also use a subset of the databaseconsisting of around 1400 images.

settings with a 1400 image subset of the test images. Thecurves show the distribution of how far the wanted imagesdrop in the query rankings. The points where a largernumber of methods meet the y-axis are given in Table 1.Note especially that the use of a larger vocabulary andalso L1 - norm gives performance improvements over the

Figure 6. Curves showing percentage (y-axis) of the ground truthquery images that make it into the top x percent (x-axis) framesof the query for a 1400 image database. The curves are shownup to 5% of the database size. As discussed in the text, it iscrucial for scalable retrieval that the correct images from thedatabase make it to the very top of the query, since verificationis feasible only for a tiny fraction of the database when thedatabase grows large. Hence, we are mainly interested in wherethe curves meet the y-axis. To avoid clutter, this number isgiven in Table 1 for a larger number of settings. A number ofconclusions can be drawn from these results: A larger vocabularyimproves retrieval performance. L1-norm gives better retrievalperformance than L2-norm. Entropy weighting is important, atleast for smaller vocabularies. Our best setting is method A, whichgives much better performance than the setting used by [17], whichis setting T.

settings used by [17].

The performance with various settings was also testedon the full 6376 image database. It is important to note thatthe scores decrease with increasing database size as thereare more images to confuse with. The effect of the shapeof the vocabulary tree is shown in Figure 7. The effects ofdefining the vocabulary tree with varying amounts of dataand training cycles are investigated in Figure 8.

Figure 10 shows a snapshot of a demonstration of themethod, running real-time on a 40000 image database ofCD covers, some connected to music. We have so fartested the method with a database size as high as 1 millionimages, more than one order of magnitude larger than anyother work we are aware of, at least in this category ofmethod. The results are shown in Figure 9. As we couldnot obtain ground truth for that size of database, the 6376image ground truth set was embedded in a database that alsocontains several movies: The Bourne Identity, The Matrix,Braveheart, Collateral, Resident Evil, Almost Famous andMonsters Inc. Note that all frames from the movies are in

Figure 5. The retrieval performance is evaluated using a largeground truth database (6376 images) with groups of four imagesknown to be taken of the same object, but under differentconditions. Each image in turn is used as query image, and thethree remaining images from its group should ideally be at thetop of the query result. In order to compare against less efficientnon-hierarchical schemes we also use a subset of the databaseconsisting of around 1400 images.

settings with a 1400 image subset of the test images. Thecurves show the distribution of how far the wanted imagesdrop in the query rankings. The points where a largernumber of methods meet the y-axis are given in Table 1.Note especially that the use of a larger vocabulary andalso L1 - norm gives performance improvements over the

Figure 6. Curves showing percentage (y-axis) of the ground truthquery images that make it into the top x percent (x-axis) framesof the query for a 1400 image database. The curves are shownup to 5% of the database size. As discussed in the text, it iscrucial for scalable retrieval that the correct images from thedatabase make it to the very top of the query, since verificationis feasible only for a tiny fraction of the database when thedatabase grows large. Hence, we are mainly interested in wherethe curves meet the y-axis. To avoid clutter, this number isgiven in Table 1 for a larger number of settings. A number ofconclusions can be drawn from these results: A larger vocabularyimproves retrieval performance. L1-norm gives better retrievalperformance than L2-norm. Entropy weighting is important, atleast for smaller vocabularies. Our best setting is method A, whichgives much better performance than the setting used by [17], whichis setting T.

settings used by [17].

The performance with various settings was also testedon the full 6376 image database. It is important to note thatthe scores decrease with increasing database size as thereare more images to confuse with. The effect of the shapeof the vocabulary tree is shown in Figure 7. The effects ofdefining the vocabulary tree with varying amounts of dataand training cycles are investigated in Figure 8.

Figure 10 shows a snapshot of a demonstration of themethod, running real-time on a 40000 image database ofCD covers, some connected to music. We have so fartested the method with a database size as high as 1 millionimages, more than one order of magnitude larger than anyother work we are aware of, at least in this category ofmethod. The results are shown in Figure 9. As we couldnot obtain ground truth for that size of database, the 6376image ground truth set was embedded in a database that alsocontains several movies: The Bourne Identity, The Matrix,Braveheart, Collateral, Resident Evil, Almost Famous andMonsters Inc. Note that all frames from the movies are in

query retrieved

from [Nister and Stewenius, ’07]

RBM predicted labels (63%)

32!RBM 16384-gist

wall

floor

poster

ceiling

door


32!RBM 16384-gist

flower


Ground truth neighbors 32!RBM 16384-gist

road

tree

car



road

sky

building

car

tree

bed




32!RBM 16384-gist

building

road

tree

car

sky

sidewalk crosswalk

road

mountaintree

car

sky

Input image

Ground truth neighbors



Input image

Input image

Input image Input image

Input image

Figure 6. This figure shows six example input images. For each image, we show the first 12 nearest neighbors when using ground truth

semantic distance (see text), using 32bits RBM and the original Gist descriptor (which uses 16384 bits). Below each set of neighbors

we show the LabelMe segmentations of each image. Those segmentations and their corresponding labels are used by a pixel-wise voting

scheme to propose a segmentation and labeling of the input image. The resulting segmentation is shown below each input image. The

number above the segmentation indicates the percentage of pixels correctly labeled. A more quantitative analysis is shown in Fig. 7.

AcknowledgmentsThe authors would like to thank Geoff Hinton and Rus

Salakhutdinov for making their RBM code available online. Fund-ing for this research was provided by NSF Career award (IIS0747120), NGA NEGI- 1582-04-0004, Shell Research and ONR-MURI Grant N00014- 06-1-0734.

References

[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate

nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.

[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan.

Matching words and pictures. JMLR, 3:1107–1135, Feb 2003.

[3] A. Bosch, A. Zisserman, and X. Muoz. Representing shape with a spatial pyra-

mid kernel. In CIVR, 2006.

[4] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image seg-

mentation using expectation-maximization and its application to image query-

ing. PAMI, 24(8):1026–1038, 2002.

[5] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and

trends of the new age. ACM Computing Surveys, page to appear, 2008.

[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, M. Gorkani,

J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image

and video content: the QBIC system. IEEE Computer, 28(9):23–32, 1995.

[7] P. Ghosh, B. Manjunath, and K. Ramakrishnan. A compact image signature for

RTS-invariant image retrieval. In IEE VIE, Sep 2006.

[8] J. Goldberger, S. T. Roweis, R. R. Salakhutdinov, and G. E. Hinton. Neighbor-

hood components analysis. In NIPS, 2004.

[9] M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos at a

glance. In Intl. Conf. Pattern Recognition, volume 1, pages 459–464, 1994.

[10] K. Grauman and T. Darrell. Pyramid match hashing: Sub-linear time indexing

over partial correspondences. In Proc. CVPR, 2007.

[11] J. Hayes and A. Efros. Scene completion using millions of photographs. SIG-

GRAPH, 2007.

[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data

with neural networks. Nature, 313(5786):504–507, July 2006.

[13] I. Kunttu, L. Lepisto, J. Rauhamaa, and A. Visa. Binary histogram in image

classification for retrieval purposes. In Intl. Conf. in Central Europe on Com-

puter Graphics, Visualization and Computer Vision, pages 269–273, 2003.

[14] J. Landre and F. Truchetet. Optimizing signal and image processing applica-

tions using intel libraries. In Proc. of QCAV 2007, page to appear, Le Creusot,

France, May 2007. SPIE.

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyra-

mid matching for recognizing natural scene categories. In CVPR, pages 2169–

2178, 2006.

[16] T. Liu, A. W. Moore, and A. Gray. Efficient exact kNN and non-parametric

classification in high dimensions. In NIPS, 2004.

[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,

60:91–110, 2004.

[18] M. Nascimento and V. Chitkara. Color-based image retrieval using binary sig-

natures. In ACM symposium on Applied computing, pages 687–692, 2002.

[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In

Proc. CVPR, pages 2161–2168, 2006.

[20] S. Obdrzalek and J. Matas. Sub-linear indexing for large scale object recogni-

tion. In BMVC, 2005.

[21] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic repre-

sentation of the spatial envelope. International Journal in Computer Vision,

42:145–175, 2001.

[22] T. Quack, U. Monich, L. Thiele, and B. Manjunath. Cortina: A system for

large-scale, content-based web image retrieval. In ACM Multimedia, 2004.

[23] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a

database and web-based tool for image annotation. Technical Report AIM-

2005-025, MIT AI Lab Memo, September, 2005.

[24] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by

preserving class neighbourhood structure. In AISTATS, 2007.

[25] R. R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR workshop

on Information Retrieval and applications of Graphical Models, 2007.

[26] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter

sensitive hashing. In Proc. ICCV, 2003.

[27] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collec-

tions in 3d. In SIGGRAPH, pages 835–846, 2006.

[28] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report

MIT-CSAIL-TR-2007-024, Computer Science and Artificial Intelligence Lab,

Massachusetts Institute of Technology, 2007.

[29] J. Wang, G. Wiederhold, O. Firschein, and S. Wei. Content-based image index-

ing and searching using daubechies’ wavelets. Int. J. Digital Libraries, 1:311–

328, 1998.


32!RBM 16384-gist

wall

floor

poster

ceiling

door


32!RBM 16384-gist

flower



road

tree

car



road

sky

building

car

tree

bed




32!RBM 16384-gist

building

road

tree

car

sky

sidewalk crosswalk

road

mountaintree

car

sky

Input image




Input image

Input image

Input image Input image

Input image

Figure 6. This figure shows six example input images. For each image, we show the first 12 nearest neighbors when using ground truth

semantic distance (see text), using 32bits RBM and the original Gist descriptor (which uses 16384 bits). Below each set of neighbors

we show the LabelMe segmentations of each image. Those segmentations and their corresponding labels are used by a pixel-wise voting

scheme to propose a segmentation and labeling of the input image. The resulting segmentation is shown below each input image. The

number above the segmentation indicates the percentage of pixels correctly labeled. A more quantitative analysis is shown in Fig. 7.

AcknowledgmentsThe authors would like to thank Geoff Hinton and Rus

Salakhutdinov for making their RBM code available online. Fund-ing for this research was provided by NSF Career award (IIS0747120), NGA NEGI- 1582-04-0004, Shell Research and ONR-MURI Grant N00014- 06-1-0734.

References

[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate

nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.

[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan.

Matching words and pictures. JMLR, 3:1107–1135, Feb 2003.

[3] A. Bosch, A. Zisserman, and X. Muoz. Representing shape with a spatial pyra-

mid kernel. In CIVR, 2006.

[4] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image seg-

mentation using expectation-maximization and its application to image query-

ing. PAMI, 24(8):1026–1038, 2002.

[5] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and

trends of the new age. ACM Computing Surveys, page to appear, 2008.

[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, M. Gorkani,

J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image

and video content: the QBIC system. IEEE Computer, 28(9):23–32, 1995.

[7] P. Ghosh, B. Manjunath, and K. Ramakrishnan. A compact image signature for

RTS-invariant image retrieval. In IEE VIE, Sep 2006.

[8] J. Goldberger, S. T. Roweis, R. R. Salakhutdinov, and G. E. Hinton. Neighbor-

hood components analysis. In NIPS, 2004.

[9] M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos at a

glance. In Intl. Conf. Pattern Recognition, volume 1, pages 459–464, 1994.

[10] K. Grauman and T. Darrell. Pyramid match hashing: Sub-linear time indexing

over partial correspondences. In Proc. CVPR, 2007.

[11] J. Hayes and A. Efros. Scene completion using millions of photographs. SIG-

GRAPH, 2007.

[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data

with neural networks. Nature, 313(5786):504–507, July 2006.

[13] I. Kunttu, L. Lepisto, J. Rauhamaa, and A. Visa. Binary histogram in image

classification for retrieval purposes. In Intl. Conf. in Central Europe on Com-

puter Graphics, Visualization and Computer Vision, pages 269–273, 2003.

[14] J. Landre and F. Truchetet. Optimizing signal and image processing applica-

tions using intel libraries. In Proc. of QCAV 2007, page to appear, Le Creusot,

France, May 2007. SPIE.

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyra-

mid matching for recognizing natural scene categories. In CVPR, pages 2169–

2178, 2006.

[16] T. Liu, A. W. Moore, and A. Gray. Efficient exact kNN and non-parametric

classification in high dimensions. In NIPS, 2004.

[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,

60:91–110, 2004.

[18] M. Nascimento and V. Chitkara. Color-based image retrieval using binary sig-

natures. In ACM symposium on Applied computing, pages 687–692, 2002.

[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In

Proc. CVPR, pages 2161–2168, 2006.

[20] S. Obdrzalek and J. Matas. Sub-linear indexing for large scale object recogni-

tion. In BMVC, 2005.

[21] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic repre-

sentation of the spatial envelope. International Journal in Computer Vision,

42:145–175, 2001.

[22] T. Quack, U. Monich, L. Thiele, and B. Manjunath. Cortina: A system for

large-scale, content-based web image retrieval. In ACM Multimedia, 2004.

[23] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a

database and web-based tool for image annotation. Technical Report AIM-

2005-025, MIT AI Lab Memo, September, 2005.

[24] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by

preserving class neighbourhood structure. In AISTATS, 2007.

[25] R. R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR workshop

on Information Retrieval and applications of Graphical Models, 2007.

[26] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter

sensitive hashing. In Proc. ICCV, 2003.

[27] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collec-

tions in 3d. In SIGGRAPH, pages 835–846, 2006.

[28] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report

MIT-CSAIL-TR-2007-024, Computer Science and Artificial Intelligence Lab,

Massachusetts Institute of Technology, 2007.

[29] J. Wang, G. Wiederhold, O. Firschein, and S. Wei. Content-based image index-

ing and searching using daubechies’ wavelets. Int. J. Digital Libraries, 1:311–

328, 1998.

from [Torralba et al., ’08]

(a)

(b)

(c)

(d)

Figure 4. Examples of searching the 5K dataset for: (a) All Soul’s College. (b) Bridge of sighs, Hertford College. (c) Ashmolean Museum.

(d) Bodleian window. The query is shown on the left, with selected top ranked retrieved images shown to the right. All results displayed

are returned before the first false positive for each query.

Acknowledgements. We thank David Lowe for discussionsand for providing his k-d tree code and Henrik Steweniusfor providing his dataset for comparison. We are gratefulfor support from the Royal Academy of Engineering, theEU Visiontrain Marie-Curie network, the EPSRC and Mi-crosoft.

References[1] http://www.robots.ox.ac.uk/!vgg/data/.

[2] http://www.vis.uky.edu/!stewe/ukbench/data/.

[3] http://www.flickr.com/.

[4] Y. Aasheim, M. Lidal, and K. Risvik. Multi-tier architecture

for web search engines. In Proc. Web Congress, 2003.

[5] Y. Amit and D. Geman. Shape quantization and recognition

with randomized trees. Neural Computing, 9(7):1545–1588,

1997.

[6] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu.

An optimal algorithm for approximate nearest neighbor

searching fixed dimensions. Journal of the ACM, 45(6):891–

923, 1998.

[7] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information

Retrieval. ACM Press, ISBN: 020139829, 1999.

[8] L. Barroso, J. Dean, and U. Holzle. Web search for a planet:

The google cluster architecture. Micro, IEEE, 23, 2003.

[9] O. Chum, J. Matas, and S. Obdrzalek. Enhancing RANSAC

by generalized model optimization. In Proc. ACCV, 2004.

[10] C. Elkan. Using the triangle inequality to accelerate kmeans,

2003.

[11] V. Ferrari, T. Tuytelaars, and L. Van Gool. Simultaneous

object recognition and segmentation by image exploration.

In Proc. ECCV, 2004.

[12] M. A. Fischler and R. C. Bolles. Random sample consensus.

Comm. ACM, 24(6):381–395, 1981.

[13] A. Gersho and R. Gray. Vector quantization and signal com-

pression. Kluwer Academic Publishers, Boston, 1992.

[14] R. I. Hartley and A. Zisserman. Multiple View Geometry

in Computer Vision. Cambridge University Press, ISBN:

0521540518, second edition, 2004.

[15] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-

time keypoint recognition. In Proc. CVPR, June 2005.

[16] D. Lowe. Distinctive image features from scale-invariant

keypoints. IJCV, 60(2):91–110, 2004.

[17] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object

class detection with a generative model. In Proc. CVPR,

2006.

[18] K. Mikolajczyk and C. Schmid. Scale & affine invariant in-

terest point detectors. IJCV, 1(60):63–86, 2004.

[19] F. Moosman, B. Triggs, and F. Jurie. Randomized clustering

forests for building fast and discriminative visual vocabular-

ies. In NIPS, 2006.

[20] D. Nister and H. Stewenius. Scalable recognition with a vo-

cabulary tree. In Proc. CVPR, 2006.

[21] S. Obdrzalek and J. Matas. Sub-linear indexing for large

scale object recognition. In Proc. BMVC., 2005.

[22] U. Shaft, J. Goldstein, and K. Beyer. Nearest neighbours

query performance. Technical report, 1998.

[23] C. Silpa-Anan and R. Hartley. Localization using an image-

map. In Proc. ACRA, 2004.

[24] J. Sivic and A. Zisserman. Video Google: A text retrieval

approach to object matching in videos. In Proc. ICCV, Oct

2003.

[25] J. Winn and A. Criminisi. Object class recognition at a

glance. In In Video Proc. CVPR, 2006.

from [Philbin et al., ’07]

Relation to other tasks

image retrieval object classification

analogies:- large databases- efficient indexing- compact representation

differences:- simple notions of visual relevancy (e.g., near-duplicate, same object instance, same spatial layout)

analogies:- recognition of object classes from a few examples

differences:- classes to recognize are defined a priori- training and recognition time is unimportant- storage of features is not an issue

novel class search

Relation to other tasks

Technical requirements of novel class-search

• The object classifier must be learned on the fly from few examples

• Recognition in the database must have low computational cost

• Image descriptors must be compact to allow storage in memory

State-of-the-art inobject classification

Winning recipe: many features + non-linear classifiers(e.g. [Gehler and Nowozin, CVPR’09])

!"#$%

&'()*+),%%-'.,()*+/%

#"0$%

...

!"#$%&#'()*+&,-)&.&#(#/*01#-2"#*

non-linear decision boundary

Model evaluation on Caltech256

0 5 10 15 20 25 300

5

10

15

20

25

30

35

40

45

number of training examples

accu

racy

(%)

gistphogphog2pissimbow5000

!"#$%&'()*$+',

'"#*"-"*.%+'/$%0.&$1


0 5 10 15 20 25 300

5

10

15

20

25

30

35

40

45


accu

racy

(%)

gistphogphog2pissimbow5000linear combination

!"#$%&'()*$+',

'"#*"-"*.%+'/$%0.&$1

!"#$%&'()*$+',/$%0.&$'2)(3"#%4)#

!"#$%&'()*$+',

'"#*"-"*.%+'/$%0.&$1

!"#$%&'()*$+',/$%0.&$'2)(3"#%4)#

5)#6+"#$%&'()*$+',/$%0.&$'2)(3"#%4)#'7%898%8':.+4;+$'<$&#$+'

!$%&#"#=>'?@$A+$&'B'5)C)D"#E'FGH


0 5 10 15 20 25 300

5

10

15

20

25

30

35

40

45


accu

racy

(%)

gistphogphog2pissimbow5000linear combinationnonlinear combination

Multiple kernel combinersClassification output is obtained by combining many features via non-linear kernels:

h(x) =F�

f=1

βf

N�

n=1

kf (x,xn)αn + b

sum over features sum over training examples

!"#$%

&'()*+),%%-'.,()*+/%

#"0$%

...where

Multiple kernel learning (MKL)

1. a linear combination of kernels: 2. the SVM parameters:

A typical example of such a feature fm would be a bag-of-visual-words histogram of the image. Then, the correspond-ing dimensionality dm would be the codebook size used forthe vector quantization step. In the following, we will usethe name feature combination method for all methods whichaddress the feature combination problem.

Kernel methods. The object classification problem is aspecial case of multiclass classification. In computer visionthe problem of learning a multiclass classifier from trainingdata is often addressed by means of kernel methods. Kernelmethods make use of kernel functions defining a measureof similarity between pairs of instances. In the context offeature combination it is useful to associate a kernel to eachimage feature as follows. For a kernel function k betweenreal vectors we define the short-hand notation

km(x, x�) = k(fm(x), fm(x�)),

such that the image kernel km : X × X → R only con-siders similarity with respect to image feature fm. If theimage feature is specific to a certain aspect, say, it only con-siders texture information, then the kernel measures simi-larity only with regard to this aspect. The subscript m ofthe kernel can then be understood as indexing into the set offeatures.

In the following, for notational convenience, we will de-note the kernel response of the m’th feature for a givensample x ∈ X to all training samples xi, i = 1, . . . , Nas Km(x) ∈ RN with

Km(x) = [km(x, x1), km(x, x2), . . . , km(x, xN )]T .

In case x is the i’th training sample, i.e. x = xi, thenKm(x) is simply the i’th column of the m’th kernel matrix.

Feature selection as kernel selection In this paper westudy a class of kernel classifiers that aim to combine sev-eral kernels into a single model. Since we associate imagefeatures with kernel functions, kernel combination/selectiontranslates naturally into feature combination/selection.

A conceptually simple approach is the use of CrossValidation (CV) to select a single kernel from the set{k1, . . . , kF }. Every feature combination method shouldbe able to outperform this baseline method or at least matchits performance if a single feature is sufficient for good clas-sification.

In the following we will present several methods in aunified setting along with their training procedures. Anoverview of the different methods in their multiclass vari-ant can also be found in the Table 1.

3. Methods: BaselinesWe include two simple baseline methods, both of which

combine kernels in a pre-defined deterministic way and sub-sequently use the resulting kernel for SVM training.

3.1. Averaging KernelsArguably the simplest method to combine several ker-

nels is to average them. We define the kernel functionk∗(x, x�) = 1

F

�Fm=1 km(x, x�), which is subsequently

used in a support vector machine (SVM).

Training The only free parameters are the SVM parame-ters. We use CV to estimate the best regularization constant.A multiclass variant is build using a one-versus-all scheme.3.2. Product Kernels

The next baseline method we consider is to combineseveral kernels by multiplication. In this case we usek∗(x, x�) = (

�Fm=1 km(x, x�))1/F as the single kernel in

a SVM.

Training Same as for averaging.

4. Methods: Multiple Kernel LearningAnother approach to perform kernel selection is to learn

a kernel combination during the training phase of the al-gorithm. One prominent instance of this class is MKL. Itsobjective is to optimize jointly over a linear combination ofkernels k∗(x, x�) =

�Fm=1 βmkm(x, x�) and the parame-

ters α ∈ RN and b ∈ R of an SVM.MKL was originally introduced in [1]. For efficiency and

in order to obtain sparse, interpretable coefficients, it re-stricts βm ≥ 0 and imposes the constraint

�Fm=1 βm = 1.

Since the scope of this paper is to access the applicabilityof MKL to feature combination rather than its optimizationpart we opted to present the MKL formulations in a way al-lowing for easier comparison with the other methods. Wewrite its objective function as

minα,β,b

12

F�

m=1

βmαT Kmα (1)

+CN�

i=1

L(yi, b +F�

m=1

βmKm(x)T α)

sb.t.F�

m=1

βm = 1, βm ≥ 0, m = 1, . . . , F,

where L(y, t) = max(0, 1 − yt) denotes the Hinge loss.We compare two different algorithms solving this problemfor their runtime performance, namely SILP [18]2 and Sim-pleMKL [17]3.

The final binary decision function of MKL is of the fol-lowing form

FMKL(x) = sign

�F�

m=1

βm(Km(x)T α + b)

�. (2)

2Available online:www.shogun-toolbox.org/3Available online:mloss.org/software/view/174/

where

Learning a non-linear SVM by jointly optimizing over

[Bach et al., 2004; Sonnenburg et al., 2006; Varma and Ray, 2007]

k∗(x, x�) =F�

f=1

βfkf (x, x�)

minα,β,b

12

F�

f=1

βfαT Kfα + CN�

n=1

L

yn, b +F�

f=1

βfKf (xn)T α

subject toF�

f=1

βf = 1, βf ≥ 0, f = 1, . . . , F

L(y, t) = max(0, 1− yt)Kf (x) = [kf (x, x1), kf (x, x2), . . . , kf (x, xN )]T

LP-β: a two-stage approach to MKL! [Gehler and Nowozin, 2009]

• Classification output of traditional MKL:

1. train each independently → traditional SVM learning

2. optimize over → a simple linear programβ

Two-stage training procedure:

hf (x)

• Classification function of LP-β:

� �� hf (x)

h(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αfn + bf

�

hMKL(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αn + b

�

LP-β for novel-class search?

The LP-β classifier:

Unsuitable for our needs due to:

• large storage requirements (typically over 20K bytes/image)• costly evaluation (requires query-time kernel distance

computation for each test image)• costly training (1+ minute for O(10) training examples)

sum over features sum over training examples

h(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αfn + bf

�

Classemes: a compact descriptor for

Key-idea: represent each image in terms of its “closeness” to a set of basis classes (“classemes”)

output of a pre-learned LP-β for the c-th basis class

φc(x) = hclassemec(x) =F�

f=1

βcf

N�

n=1

kf (x,xcn)αc

n + bc

Φ(x) = [φ1(x), . . . ,φC(x)]Tx

x

! [Torresani et al., 2010]efficient recognition

� �� LP-β trained before the creation of the database

trained at query-time

gduck(Φ(x);wduck) = Φ(x)T wduck =C�

c=1

wduckc

F�

f=1

βcf

N�

n=1

kf (x,xcn)αc

n + bc

...Φ(x1) Φ(xN)Query-time learning: train a linear classifier on Φ(x)

training examples of novel class

How this works...• Accurate semantic labels are not required...

• Classeme classifiers are just used as detectors for specific patterns of texture, color, shape, etc.

E!cient Object Category Recognition Using Classemes 777

Table 1. Highly weighted classemes. Five classemes with the highest LP-! weightsfor the retrieval experiment, for a selection of Caltech 256 categories. Some may appearto make semantic sense, but it should be emphasized that our goal is simply to createa useful feature vector, not to assign semantic labels. The somewhat peculiar classemelabels reflect the ontology used as a source of base categories.

!"#$%&'"()*+$ ,-(./+$#"-(.'"0$%/&11"2"1$

%)#3)+4.&'$ !"#$"%& '()*%'+&%*,-.& -,."+&(,/& -)##"-%01#"& $2330/+&(,/&

05%6$ 1)$1"*+&(#,/"& 1)45+&)3+&6,%"*& '60$$"*& 6,#.0/7& '%*,07!%&

"/6$ 3072*"+&'.,%"*&12##+&$,#"+&!"*4+&

,/0$,#&-,%%#"& 7*,8"'0%"& 4",4+&1)45&

7*-13""$ 6,%"*-*,3%+&'2*3,-"& '-'0+&-,1#"& ,#,*$+&-#)-.& !0/42& '"*80/7+&%*,5&

'*-/)3-'"4898$ -)/8"9+&%!0/7& $0/"4+&,*",& -4(#,5"*& *),'%0/7+&(,/&6"'%"*/+&!"$0'(!"*"+&

("*')/&

#.""/3&**)#$%,.0/7+&-,*"+&)3+&

')$"%!0/7&1,77,7"+&()*%"*& -,/)(5+&-#)'2*"+&)("/& *)60/7+&'!"##&

-)/%,0/"*+&(*"''2*"+&

1,**0"*&

Large-scale recognition benefits from a compact descriptor for each image,for example allowing databases to be stored in memory rather than on disk. Thedescriptor we propose is 2 orders of magnitude more compact than the state ofthe art, at the cost of a small drop in accuracy. In particular, performance of thestate of the art with 15 training examples is comparable to our most compactdescriptor with 30 training examples.

The ideal descriptor also provides good results with simple classifiers, suchas linear SVMs, decision trees, or tf-idf, as these can be implemented to rune!ciently on large databases.

Although a number of systems satisfy these desiderata for object instanceor place recognition [18,9] or for whole scene recognition [26], we argue thatno existing system has addressed these requirements in the context of objectcategory recognition.

The system we propose is a form of classifier combination, the components ofthe proposed descriptor are the outputs of a set of predefined category-specificclassifiers applied to the image. The obvious (but only partially correct) intu-ition is that a novel category, say duck, will be expressed in terms of the outputsof base classifiers (which we call “classemes”), describing either objects similarto ducks, or objects seen in conjunction with ducks. Because these base classi-fier outputs provide a rich coding of the image, simple classifiers such as linearSVMs can approach state-of-the art accuracy, satisfying the requirements listedabove. However, the reason this descriptor will work is slightly more subtle. It isnot required or expected that these base categories will provide useful semanticlabels, of the form water, sky, grass, beak. On the contrary, we work on theassumption that modern category recognizers are essentially quite dumb; so aswimmer recognizer looks mainly for water texture, and the bomber!plane rec-ognizer contains some tuning for “C” shapes corresponding to the airplane nose,and perhaps the “V” shapes at the wing and tail. Even if these recognizers areperhaps overspecialized for recognition of their nominal category, they can stillprovide useful building blocks to the learning algorithm that learns to recognize

Related work

• Attribute-based recognition:

Figure 4: Attribute prediction for across category protocols. On the leftis Leave-one-class-out case for Pascal and on the right is attribute predic-tion for Yahoo set. Only attributes relevant to these tasks are displayed.Classes are different during training and testing, thus we have across cat-egory generalization issues. Some attributes on the left, like “engine”,“snout”, and “furry”, generalize well, some do not. Feature selection helpsconsiderably for those attributes, like “taillight”, “cloth”, and “rein” thathave problem generalizing across classes. Similar to leave one class outcase, learning attributes on Pascal08 train set and testing them on Yahooset involves across category generalization, right plot. We can, in fact, pre-dict attributes for new classes fairly reliably. Some attributes, like “wing”,“door”, “headlight”, and “taillight”, do not generalize well. Feature se-lection improves generalization on those attributes. Toward the high endof this curve, where good classifiers sit, feature selection improves predic-tion of attribute with generalization issues and produce similar results forattributes without generalization issues. For better visualization purposeswe sorted the plots based on selected features’ area under ROC curve val-ues.

benefits of our novel feature selection method compared tousing whole features.

6.1. Describing Objects

Assigning attributes: There are two main protocols forattribute prediction: “within category” predictions, wheretrain and test instances are drawn from the same set ofclasses, and “across category” predictions where train andtest instances are drawn from different sets of classes. Wedo across category experiments using a leave-one-class-outapproach, or a new set of classes on a new dataset. We trainattributes in a-Pascal and test them in a-Yahoo. We measureour performance in attribute predictions by the area underthe ROC curve, mainly because it is invariant to class pri-ors. We can predict attributes for the within category proto-col with the area under the curve of 0.834 (Figure 3).

Figure 4 shows that we can predict attributes fairly re-liably for across category protocols. The plot on the leftshows the leave-one-class-out case on a-Pascal and the ploton the right shows the same curve for a-Yahoo set.

Figure 5 depicts 12 typical images from a-Yahoo set witha subset of positively predicted attributes. These attributeclassifiers are learned on a-Pascal train set and tested on a-Yahoo images. Attributes written in red, with red crosses,are wrong predictions.

Unusual attributes: People tend to make statementsabout unexpected aspects of known objects ([11], p101).An advantage of an attribute based representation is wecan easily reproduce this behavior. The ground truth at-tributes specify which attributes are typical for each class.If a reliable attribute classifier predicts one of these typi-cal attributes is absent, we report that it is not visible inthe image. Figure 6 shows some of these typical attributeswhich are not visible in the image. For example, it is worthreporting when we do not see the “wing” an aeroplane isexpected to have. To qualitatively evaluate this task we re-

Figure 5: This figure shows randomly selected positively predicted at-tributes for 12 typical images from 12 categories in Yahoo set. Attributeclassifiers are learned on Pascal train set and tested on Yahoo set. We ran-domly select 5 predicted attributes from the list of 64 attributes available inthe dataset. Bounding boxes around the objects are provided by the datasetand we are only looking inside the bounding boxes to predict attributes.Wrong predictions are written in red and marked with red crosses.

Figure 6: Reporting the absence of typical attributes. For example, weexpect to see “Wing”in an aeroplane. It is worth reporting if we see apicture of an aeroplane for which the wing is not visible or a picture of abird for which the tail is not visible.

Figure 7: Reporting the presence of atypical attributes. For example, wedon’t expect to observe “skin” on a dining table. Notice that, if we haveaccess to information about object semantics, observing “leaf” in an imageof a bird might eventually yield “The bird is in a tree”. Sometimes ourattribute classifiers are confused by some misleading visual similarities,like predicting “Horn” from the visually similar handle bar of a road bike.

ported 752 expected attributes over the whole dataset whichare not visible in the images. 68.2% of these reports arecorrect when compared to our manual labeling of those re-ports (Figure 6). On the other hand, if a reliable attributeclassifier predicts an attribute which is not expected to bein the predicted class, we can report that, too (Figure 7).For example, birds don’t have a “leaf”, and if we see onewe should report it. To quantitatively evaluate this predic-tion we evaluate 951 of those predictions by hand; 47.3%are correct. There are two important consequences. First,because birds never have leaves, we may be able to exploitknowledge of object semantics to reason that, in this case,the bird is in a tree. Second, because we can localize fea-tures used to predict attributes, we can show what causedthe unexpected attribute to be predicted (Figure 8). For ex-ample, we can sometimes tell where the “metal” is in a pic-

Figure 4: Attribute prediction for across category protocols. On the leftis Leave-one-class-out case for Pascal and on the right is attribute predic-tion for Yahoo set. Only attributes relevant to these tasks are displayed.Classes are different during training and testing, thus we have across cat-egory generalization issues. Some attributes on the left, like “engine”,“snout”, and “furry”, generalize well, some do not. Feature selection helpsconsiderably for those attributes, like “taillight”, “cloth”, and “rein” thathave problem generalizing across classes. Similar to leave one class outcase, learning attributes on Pascal08 train set and testing them on Yahooset involves across category generalization, right plot. We can, in fact, pre-dict attributes for new classes fairly reliably. Some attributes, like “wing”,“door”, “headlight”, and “taillight”, do not generalize well. Feature se-lection improves generalization on those attributes. Toward the high endof this curve, where good classifiers sit, feature selection improves predic-tion of attribute with generalization issues and produce similar results forattributes without generalization issues. For better visualization purposeswe sorted the plots based on selected features’ area under ROC curve val-ues.

benefits of our novel feature selection method compared tousing whole features.

6.1. Describing Objects

Assigning attributes: There are two main protocols forattribute prediction: “within category” predictions, wheretrain and test instances are drawn from the same set ofclasses, and “across category” predictions where train andtest instances are drawn from different sets of classes. Wedo across category experiments using a leave-one-class-outapproach, or a new set of classes on a new dataset. We trainattributes in a-Pascal and test them in a-Yahoo. We measureour performance in attribute predictions by the area underthe ROC curve, mainly because it is invariant to class pri-ors. We can predict attributes for the within category proto-col with the area under the curve of 0.834 (Figure 3).

Figure 4 shows that we can predict attributes fairly re-liably for across category protocols. The plot on the leftshows the leave-one-class-out case on a-Pascal and the ploton the right shows the same curve for a-Yahoo set.

Figure 5 depicts 12 typical images from a-Yahoo set witha subset of positively predicted attributes. These attributeclassifiers are learned on a-Pascal train set and tested on a-Yahoo images. Attributes written in red, with red crosses,are wrong predictions.

Unusual attributes: People tend to make statementsabout unexpected aspects of known objects ([11], p101).An advantage of an attribute based representation is wecan easily reproduce this behavior. The ground truth at-tributes specify which attributes are typical for each class.If a reliable attribute classifier predicts one of these typi-cal attributes is absent, we report that it is not visible inthe image. Figure 6 shows some of these typical attributeswhich are not visible in the image. For example, it is worthreporting when we do not see the “wing” an aeroplane isexpected to have. To qualitatively evaluate this task we re-

Figure 5: This figure shows randomly selected positively predicted at-tributes for 12 typical images from 12 categories in Yahoo set. Attributeclassifiers are learned on Pascal train set and tested on Yahoo set. We ran-domly select 5 predicted attributes from the list of 64 attributes available inthe dataset. Bounding boxes around the objects are provided by the datasetand we are only looking inside the bounding boxes to predict attributes.Wrong predictions are written in red and marked with red crosses.

Figure 6: Reporting the absence of typical attributes. For example, weexpect to see “Wing”in an aeroplane. It is worth reporting if we see apicture of an aeroplane for which the wing is not visible or a picture of abird for which the tail is not visible.

Figure 7: Reporting the presence of atypical attributes. For example, wedon’t expect to observe “skin” on a dining table. Notice that, if we haveaccess to information about object semantics, observing “leaf” in an imageof a bird might eventually yield “The bird is in a tree”. Sometimes ourattribute classifiers are confused by some misleading visual similarities,like predicting “Horn” from the visually similar handle bar of a road bike.

ported 752 expected attributes over the whole dataset whichare not visible in the images. 68.2% of these reports arecorrect when compared to our manual labeling of those re-ports (Figure 6). On the other hand, if a reliable attributeclassifier predicts an attribute which is not expected to bein the predicted class, we can report that, too (Figure 7).For example, birds don’t have a “leaf”, and if we see onewe should report it. To quantitatively evaluate this predic-tion we evaluate 951 of those predictions by hand; 47.3%are correct. There are two important consequences. First,because birds never have leaves, we may be able to exploitknowledge of object semantics to reason that, in this case,the bird is in a tree. Second, because we can localize fea-tures used to predict attributes, we can show what causedthe unexpected attribute to be predicted (Figure 8). For ex-ample, we can sometimes tell where the “metal” is in a pic-

[Farhadi et al., CVPR’09]

Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer

Christoph H. Lampert Hannes Nickisch Stefan Harmeling

Max Planck Institute for Biological Cybernetics, Tubingen, Germany

{firstname.lastname}@tuebingen.mpg.de

Abstract

We study the problem of object classification when train-

ing and test classes are disjoint, i.e. no training examples of

the target classes are available. This setup has hardly been

studied in computer vision research, but it is the rule rather

than the exception, because the world contains tens of thou-

sands of different object classes and for only a very few of

them image, collections have been formed and annotated

with suitable class labels.

In this paper, we tackle the problem by introducing

attribute-based classification. It performs object detection

based on a human-specified high-level description of the

target objects instead of training images. The description

consists of arbitrary semantic attributes, like shape, color

or even geographic information. Because such properties

transcend the specific learning task at hand, they can be

pre-learned, e.g. from image datasets unrelated to the cur-

rent task. Afterwards, new classes can be detected based

on their attribute representation, without the need for a new

training phase. In order to evaluate our method and to facil-

itate research in this area, we have assembled a new large-

scale dataset, “Animals with Attributes”, of over 30,000 an-

imal images that match the 50 classes in Osherson’s clas-

sic table of how strongly humans associate 85 semantic at-

tributes with animal classes. Our experiments show that

by using an attribute layer it is indeed possible to build a

learning object detection system that does not require any

training images of the target classes.

1. Introduction

Learning-based methods for recognizing objects in natu-

ral images have made large progress over the last years. For

specific object classes, in particular faces and vehicles, reli-

able and efficient detectors are available, based on the com-

bination of powerful low-level features, e.g. SIFT or HoG,

with modern machine learning techniques, e.g. boosting or

support vector machines. However, in order to achieve good

classification accuracy, these systems require a lot of man-

ually labeled training data, typically hundreds or thousands

of example images for each class to be learned.

It has been estimated that humans distinguish between

at least 30,000 relevant object classes [3]. Training con-

ventional object detectors for all these would require mil-

otter

black: yeswhite: nobrown: yesstripes: nowater: yeseats fish: yes

polar bear

black: nowhite: yesbrown: nostripes: nowater: yeseats fish: yes

zebra

black: yeswhite: yesbrown: nostripes: yeswater: noeats fish: no

Figure 1. A description by high-level attributes allows the transfer

of knowledge between object categories: after learning the visual

appearance of attributes from any classes with training examples,

we can detect also object classes that do not have any training

images, based on which attribute description a test image fits best.

lions of well-labeled training images and is likely out of

reach for years to come. Therefore, numerous techniques

for reducing the number of necessary training images have

been developed, some of which we will discuss in Section 3.

However, all of these techniques still require at least some

labeled training examples to detect future object instances.

Human learning is different: although humans can learn

and abstract well from examples, they are also capable of

detecting completely unseen classes when provided with a

high-level description. E.g., from the phrase “eight-sided

red traffic sign with white writing”, we will be able to detect

stop signs, and when looking for “large gray animals with

long trunks”, we will reliably identify elephants. We build

on this paradigm and propose a system that is able to detect

objects from a list of high-level attributes. The attributes

serve as an intermediate layer in a classifier cascade and

they enable the system to detect object classes, for which it

had not seen a single training example.

Clearly, a large number of possible attributes exist and

collecting separate training material to learn an ordinary

classifier for each of them would be as tedious as for all

object classes. But, instead of creating a separate training

[Lampert et al., CVPR’09]

requires hand-specified attribute-class associations

attribute classifiers must be trained with human-labeled examples

Method overview1. Classeme learning

2. Using the classemes for recognition and retrieval

φ”body of water”(x)→ �

training examples of novel class

...Φ(x1) Φ(xN)

gduck(Φ(x)) =C�

c=1

wduckc φc(x)

...φ”walking”(x)→ �

Classeme learning:choosing the basis classes

• Classeme labels desiderata:

- must be visual concepts

- should span the entire space of visual classes

• Our selection: concepts defined in the Large Scale Ontology for Multimedia [LSCOM] to be “useful, observable and feasible for automatic detection”.

2659 classeme labels, after manual elimination of plurals, near-duplicates, and inappropriate concepts

Classeme learning:gathering the training data

• We downloaded the top 150 images returned by Bing Images for each classeme label

• For each of the 2659 classemes, a one-versus-the-rest training set was formed to learn a binary classifier

yes no

φ”walking”(x)

Classeme learning:training the classifiers

• Each classeme classifier is an LP-β kernel combiner [Gehler and Nowozin, 2009]:

• We use 13 kernels based on spatial pyramid histograms computed from the following features:

- color GIST [Oliva and Torralba, 2001]- oriented gradients [Dalal and Triggs, 2009]- self-similarity descriptors [Schechtman and Irani, 2007]- SIFT [Lowe, 2004]

φ(x) =F�

f=1

βf

�N�

n=1

kf (x,xn)αf,n + bf

�

linear combination of feature-specific SVMs

A dimensionality reduction view of classemes

x =

GIST

oriented gradients

self-similarity descriptor

SIFT

φ1(x)

...φ2659(x)

Φ

• 23K bytes/image

• non-linear kernels are needed for good classification

• near state-of-the-art accuracy with linear classifiers

• can be quantized down to <200 bytes/image with almost no recognition loss

Experiment 1: multiclass recognition on Caltech256

LP-β in [Gehler & Nowozin, 2009]using 39 kernels

LP-β with our x

our approach:linear SVM withclassemes Φ(x)

0 10 20 30 40 500

10

20

30

40

50

60


accu

racy (

%)

LPbeta

LPbeta13

MKL

Csvm

Cq1svm

Xsvm

linear SVM with x

linear SVM withbinarized classemes, i.e. (Φ(x) > 0)

LPbeta Csvm0

500

1000

1500

tim

e (

min

ute

s)

Computational cost comparison

Training time Testing time

LPbeta Csvm0

10

20

30

40

tim

e (

ms)

23 hours

9 minutes

10 15 20 25 30 35 40 4510

0

101

102

103

104

accuracy (%)

co

mp

actn

ess (

ima

ge

s p

er

MB

)

LPbeta13

Csvm

Cq1svm

nbnn [Boiman et al., 2008]

emk [Bo and Sminchisescu, 2008]

Xsvm

Accuracy vs. compactness

Lines link performance at 15 and 30 training examples

188 bytes/image

2.5K bytes/image

23K bytes/image

128K bytes/image

Experiment 2: object class retrievalE!cient Object Category Recognition Using Classemes 787

0 10 20 30 40 500

5

10

15

20

25

30

Number of training images

Pre

cisi

on

@ 2

5

Csvm

Cq1Rocchio (!=1, "=0)

Cq1Rocchio (!=0.75, "=0.15)

Bowsvm

BowRocchio (!=1, "=0)

BowRocchio (!=0.75, "=0.15)

Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match thequery class. Random performance is 0.4%.

We consider two di!erent retrieval methods. The first method is a linear SVMlearned for each of the Caltech classes using the one-vs-all strategy. We comparethese classifiers to the Rocchio algorithm [15], which is a classic informationretrieval technique for implementing relevance feedback. In order to use thismethod we represent each image as a document vector d(x). In the case of theBOW model, d(x) is the traditional tf-idf-weighted histogram of words. In thecase of classemes instead, we define d(x)i = [!i(x) > 0]·idfi, i.e. d(x) is computedby multiplying the binarized classemes by their inverted document frequencies.Given, a set of relevant training images Dr, and a set of non-relevant examplesDnr, Rocchio’s algorithm computes the document query

q = "1

|Dr|!

xr!Dr

d(xr) ! #1

|Dnr|!

xnr!Dnr

d(xnr) (1)

where " and # are scalar values. The algorithm then retrieves the databasedocuments having highest cosine similarity with this query. In our experiment,we set Dr to be the training examples of the class to retrieve, and Dnr tobe the remaining training images. We report results for two di!erent settings:(", #) = (0.75, 0.15), and (", #) = (1, 0) corresponding to the case where onlypositive feedback is used.

Figure 4 shows that methods using classemes consistently outperform thealgorithms based on traditional BOW features. Furthermore, SVM yields muchbetter precision than Rocchio’s algorithm when using classemes. Note that theselinear classifiers can be evaluated very e"ciently even on large data sets; further-more, they can also be trained e"ciently and thus used in applications requiringfast query-time learning: for example, the average time required to learn a one-vs-all SVM using classemes is 674 ms when using 5 training examples from eachCaltech class.

• Random performance is 0.4%

Prec

isio

n (%

) @

25

• training Csvm takes 0.6 sec with 5*256 training examples

Analogies with text retrieval• Classeme representation of an image:

presence/absence of visual attributes

• Bag-of-word representation of a text-document:

presence/absence of words

Related work• Prior work (e.g., [Sivic & Zisserman, 2003; Nister & Stewenius, 2006;

Philbin et al., 2007]) has exploited a similar analogy for object-instance retrieval by representing images as bag of visual words

…

Detect interest patches Compute SIFT descriptors [Lowe, 2004]

Quantize descriptors

…

…..

fre

qu

en

cy

codewords

Represent image as a sparse histogram of visual words

• To extend this methodology to object-class retrieval we need:- to use a representation more suited to object class recognition (e.g. classemes as opposed to bag of visual words)- to train the ranking/retrieval function for every new query-class

Data structures for efficient retrieval

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

Incidence matrix:

docu

men

ts

features f0 f1 f2 f3 f4 f5 f6 f7

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

• very compact: only one bit per feature entry

• enables efficient calculation of as: wT Φ, ∀Φ

�

i s.t. Φi �=0

wiΦi

Efficient retrieval via inverted index

I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Goal: compute score for all binary vectors in the databasewT Φ, ∀Φ Φ


I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9


I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]



I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]



I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]



I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]



I0I2I3I4I6I8

I2I7I8

I0I1I3

I2I3I9

I0I6I8I9

I4I5

I1I4I5I7I9

Inverted index:

f0 f1 f2 f3 f4 f5 f6 f7

I6I9

w: [1.5 -2 0 -5 0 3 -2 0 ]

Cost of scoring is linear in the sum of the lengths of inverted lists associated to non-zero weights

Improve efficiency via sparse weight vectors

Key-idea: force w to contain as many zeros as possible

E(w) = R(w) + CN

�Nn=1 L(w; Φn, yn)

Learning objective

regularizer loss function

label ofexample n

classeme vector of example n

• L2-SVM: ,R(w) = wT w L(w; Φn, yn) = max(0, 1− yn(wT Φn))

• Since for small and for large ,

|wi| > w2i wi

|wi| < w2i wi

choosing will tend to

produce a small number of larger

weights and more zero weights

R(w) =�

i |wi|

Tomographic inversion with !1 wavelet penalization 3

w1

w2

d = AWTw

!1-ball: |w1| + |w2| = constant

!2-ball: w21 + w2

2 = constant

w with d = AWTw and smallest !1-norm

w with d = AWTw and smallest !2-norm

w

|w|

w2

Figure 1. Sparsity, !1 minimization and !2 minimization: Left: Because the !1-ball has no bulge, the solution with smallest !1-norm issparser than the solution with smallest !2-norm. Right: A |w| penalization e!ects small coe"cients more and large coe"cients less thanthe (traditional) w2 penalization.

where c(n) is independent of w. This functional has a much simpler form than the original I1(w) because there is no operator

AWT mixing di!erent components of w. The next approximation w

(n+1) is defined by the minimizer of this new functional.

By calculating the derivative of expression (3) with respect to a specific wavelet or scaling coe"cient wi, one finds the following

set of component-by-component equations:

wi !!

WATd + (I ! WA

TAW

T )w(n)"

i+ " sign(wi) = 0, (4)

valid whenever wi "= 0. These equations are solved by distinguishing the two cases wi > 0 and wi < 0; the solution —

corresponding to the minimizer of the surrogate functional I(n)1 (w), and denoted by w

(n+1) — is then found to equal

w(n+1) = S!

#

WATd + (I ! WA

TAW

T )w(n)$

, (5)

where S! is the so-called soft-thresholding operation, i.e. (see Fig. 2, right side)

S! (w) =

%

&

'

w ! " w # "

0 |w| $ "

w + " w $ !",

(6)

performed on each wavelet or scaling coe"cient wi individually. The starting point of the iteration procedure is arbitrary,

e.g. w(0) = 0. Because of the component-wise character of the tresholding, it is straightforward to use di!erent thresholds

"i for di!erent components wi if desired, and in fact we shall use di!erent thresholds "w and "s for the wavelet and scaling

coe"cients in our application. A schematic representation of the idea behind the iteration (5) is given in Fig. 2. We realize

that this iteration converges slowly for ill-conditioned matrices, but we use it here because it is proven to converge to the

solution (Daubechies et al. 2004).

An improvement in convergence can be gained by rescaling the operator A (and rescaling the data d at the same time)

in such a way that the largest eigenvalue of #2A

TA is close to (but smaller than) unity. The iteration corresponding to the

minimization of this new functional is

w(n+1) = S!"2

#

#2WA

Td + (I ! #2

WATAW

T )w(n)$

. (7)

We will also make use of the following two-step procedure: from the outcome m = WTw of the iteration (7), we define new

data d! = 2d ! Am and restart the same iteration with this new data:

w(n+1) = S!"2

#

#2WA

Td! + (I ! #2

WATAW

T )w(n)$

, w(0) = w. (8)

The outcome m = WTw of this second iteration is then the final, regularized reconstruction of the model. For the same value

of "#2, the second step improves the data fit considerably, %d ! Am%2 < %d ! Am%2; hence a given level of final data fit

$2 will, in the two-step procedure, correspond to a higher value of "#2. Because "#2 determines the threshold level, a higher

value will lead to more aggressive thresholding and thus faster convergence to a sparse solution.

The above method will be demonstrated in the next section and compared to a conventional !2-regularization method,

in which the functional

I2(m) = %d ! Am%22 + "%m%2

2 (9)

|wi|w2

i

wi

Improve efficiency via sparse weight vectors

Key-idea: force w to contain as many zeros as possible

E(w) = R(w) + CN


Learning objective


label ofexample n

classeme vector of example n

• L2-SVM: ,R(w) = wT w L(w; Φn, yn) = max(0, 1− yn(wT Φn))

• L1-LR: , L(w; Φn, yn) = log(1 + exp(−ynwT Φn))

• FGM (Feature Generating Machine) [Tan et al., 2010]:

R(w) = wT w , L(w; Φn, yn) = max(0, 1− yn(w ⊙ d)T Φn)

s.t. 1T d ≤ B d ∈ {0, 1}D elementwise product

R(w) =�

i |wi|

Performance evaluation on ImageNet (10M images)

20 40 60 80 100 120 1400

5

10

15

20

25

30

35

Search time per query (seconds)

Pre

cisi

on @

10

(%)

Full inner product evaluation L2 SVMFull inner product evaluation L1 LRInverted index L2 SVMInverted index L1 LR

• Performance averaged over 400 object classes used as queries

• 10 training examples per query class• Database includes 450 images of the query

class and 9.7M images of other classes• Prec@10 of a random classifiers is 0.005%

Each curve is obtained by varying sparsity through C in training objective

E(w) = R(w) + CN



20 40 60 80 100 120 1400

5

10

15

20

25

30

35


Prec

isio

n @

10

(%)

! [Rastegari et al., 2011]

Top-k ranking

• Do we need to rank the entire database? - users only care about the top-ranked images

• Key idea: - for each image iteratively update an upper-bound and a lower-bound on the score

- gradually prune images that cannot rank in the top-k

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]


I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

→ initial upper bound

• Highest possible score:for binary vector s.t. ΦU

u∗ = wT · ΦU

ΦUi = 1 iff wi > 0

(6 in this case)

→ initial lower bound

• Lowest possible score:for binary vector s.t.

ΦLi = 1 iff wi < 0

l∗ = wT · ΦL

ΦL

(-10 in this case)

Top-k pruning ! [Rastegari et al., 2011]

I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

• Initialization: for all imagesu∗, l∗

0

upper bound

lower bound

w: [ 3 -2 0 -6 0 3 -2 0 ]


I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

w: [ 3 -2 0 -6 0 3 -2 0 ]

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0

• Load feature i • Since wi = +3 (>0), for each image n:

- subtract +3 from the upper bound if - add +3 to the lower bound if

φn,i = 0φn,i = 1


I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

φn,i = 0φn,i = 1

w: [ 3 -2 0 -6 0 3 -2 0 ]

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0

• Load feature i • Since wi = -2 (<0), for each image n:

- decrement by 2 the upper bound if - increment by 2 the lower bound if


I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

φn,i = 0φn,i = 1

w: [ 3 -2 0 -6 0 3 -2 0 ]

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0

• Load feature i • Since wi = -6 (<0), for each image n:

- decrement by 6 the upper bound if - increment by 6 the lower bound if


• Suppose k = 4:we can prune I2,I9 since they cannot rank in the top-k

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

0I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1

f0 f1 f2 f3 f4 f5 f6 f7

w: [ 3 -2 0 -6 0 3 -2 0 ]

Distribution of weights and pruning rate

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

ICCV#1745

ICCV#1745

ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

Dimension

Dis

tribu

tion

of a

bsol

ute

wei

ght v

alue

s

L1−LRL2−SVMFGM

a 0 500 1000 1500 2000 25000

20

40

60

80

100

Number of iterations (d)

% o

f im

ages

pru

ned

TkP L1−LR, k=10TkP L1−LR, k=3000TkP L2−SVM, k=10TkP L2−SVM, k=3000TkP FGM, k=10TkP FGM, k=3000

b

Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster withsparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values of k (k = 10, 3000).

a smaller value of k allows the method to eliminate moreimages from consideration at a very early stage.

We now turn to study the effect of parameters D!, v, won the efficiency and accuracy of AR. Figure 3 shows re-trieval speed and precision obtained by varying v and wfor D! ! {128, 256, 512}. Increasing the dictionary size(w) reduces the quantization error while raising the quan-tization time: note the slightly better accuracy but highersearch time when we move from parameter setting (D! =512, v = 256, w = 26) to (D! = 512, v = 256, w = 28).The number of sub-blocks (v) critically affects the retrievaltime: reducing v lowers a lot the search time but causes adrop in accuracy. Finally, note how D! impacts the accu-racy since it affects both the number of parameters in theclassifier as well as the projection error: using a large D!

is beneficial for accuracy when v and w are large; however,when there are few cluster centroids or the number of sub-blocks is small, lowering D! improves precision since thismitigates the quantization error.

Finally, we also ran an experiment simulating real-worldusage of an object-class retrieval system where a user mayprovide a positive training set but no negative set. In suchcases one could use a “background” set for the negative ex-amples. Thus, here we used as negative examples for eachquery, n" = 999 randomly chosen images from all 1000categories, thus possibly containing also some true positives(i.e., images of the query class). As expected, we foundthe precisions of the L1-LR and L2-SVM classifiers to benearly unchanged by the few incorrectly labeled examples:precisions at 10 in this case are 18.75% and 22.55%, respec-tively.

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20


Prec

isio

n @

10

(%)

D’=512D’=256D’=128

v=128w=28

v=64w=26

v=32w=28

v=128w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=26

v=16w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=28

Figure 3. Effects of parameters D!, v, w on the accuracy andsearch time of AR for the ILSVRC2010 data set. A small v impliesfaster retrieval at the expense of accuracy. Using a larger value forw reduces the quantization error at a small increase in search time.LoweringD! decreases the power of the classifier (VC-dimension)and increases the PCA projection error, thus negatively impactingprecision.

Retrieval results on ImageNet (10M images). We nowpresent results on the 10-million ImageNet dataset [4]which encompasses over 15,000 categories (in our exper-iment we used 15203 classes). We used a subset of 950categories as query classes. For each of these classes wecapped the number of true positives in the database to ben+

test = 450. The total number of distractors for each queryis n"

test = 9, 671, 611. We trained classifiers for each querycategory using a training set consisting of n+ = 10 posi-

6

Features considered in descending order of |wi |

540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593

594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647

ICCV#1745

ICCV#1745


0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

Dimension

Dis

tribu

tion

of a

bsol

ute

wei

ght v

alue

s

L1−LRL2−SVMFGM

a 0 500 1000 1500 2000 25000

20

40

60

80

100

Number of iterations (d)

% o

f im

ages

pru

ned

TkP L1−LR, k=10TkP L1−LR, k=3000TkP L2−SVM, k=10TkP L2−SVM, k=3000TkP FGM, k=10TkP FGM, k=3000

b

Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster withsparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values of k (k = 10, 3000).

a smaller value of k allows the method to eliminate moreimages from consideration at a very early stage.

We now turn to study the effect of parameters D!, v, won the efficiency and accuracy of AR. Figure 3 shows re-trieval speed and precision obtained by varying v and wfor D! ! {128, 256, 512}. Increasing the dictionary size(w) reduces the quantization error while raising the quan-tization time: note the slightly better accuracy but highersearch time when we move from parameter setting (D! =512, v = 256, w = 26) to (D! = 512, v = 256, w = 28).The number of sub-blocks (v) critically affects the retrievaltime: reducing v lowers a lot the search time but causes adrop in accuracy. Finally, note how D! impacts the accu-racy since it affects both the number of parameters in theclassifier as well as the projection error: using a large D!

is beneficial for accuracy when v and w are large; however,when there are few cluster centroids or the number of sub-blocks is small, lowering D! improves precision since thismitigates the quantization error.

Finally, we also ran an experiment simulating real-worldusage of an object-class retrieval system where a user mayprovide a positive training set but no negative set. In suchcases one could use a “background” set for the negative ex-amples. Thus, here we used as negative examples for eachquery, n" = 999 randomly chosen images from all 1000categories, thus possibly containing also some true positives(i.e., images of the query class). As expected, we foundthe precisions of the L1-LR and L2-SVM classifiers to benearly unchanged by the few incorrectly labeled examples:precisions at 10 in this case are 18.75% and 22.55%, respec-tively.

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20


Prec

isio

n @

10

(%)

D’=512D’=256D’=128

v=128w=28

v=64w=26

v=32w=28

v=128w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=26

v=16w=28

v=64w=26

v=32w=28

v=16w=28

v=256w=28

Figure 3. Effects of parameters D!, v, w on the accuracy andsearch time of AR for the ILSVRC2010 data set. A small v impliesfaster retrieval at the expense of accuracy. Using a larger value forw reduces the quantization error at a small increase in search time.LoweringD! decreases the power of the classifier (VC-dimension)and increases the PCA projection error, thus negatively impactingprecision.

Retrieval results on ImageNet (10M images). We nowpresent results on the 10-million ImageNet dataset [4]which encompasses over 15,000 categories (in our exper-iment we used 15203 classes). We used a subset of 950categories as query classes. For each of these classes wecapped the number of true positives in the database to ben+

test = 450. The total number of distractors for each queryis n"

test = 9, 671, 611. We trained classifiers for each querycategory using a training set consisting of n+ = 10 posi-

6

norm

aliz

ed a

bsol

ute

wei

ght

valu

es

Performance evaluation on ImageNet (10M images)

0 50 100 1500

5

10

15

20

25

30

35


Prec

isio

n @

10

(%)

TkP L1−LRTkP L2−SVMInverted index L1−LRInverted index L2−SVM

0 50 100 1500

5

10

15

20

25

30

35


Prec

isio

n @

10

(%)

• k = 10• Performance averaged over 400 object

classes used as queries• 10 training examples per query class• Database includes 450 images of the query

class and 9.7M images of other classes• Prec@10 of a random classifiers is 0.005%

Each curve is obtained by varying sparsity through C in training objective

E(w) = R(w) + CN




Alternative search strategy:approximate ranking

• Key-idea: approximate the score function with a measure that can computed (more) efficiently (related to approximate NN search:[Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al., 2008])

• Approximate ranking via vector quantization:

!q(!)

wT Φ ≈ wT q(Φ)

where is a quantizer returning the cluster centroid nearest to

q(.)Φ

• Problem: - to approximate well the score we need a fine quantization- the dimensionality of our space is D=2659: too large to enable a fine quantization using k-means clustering

Product quantization! [Jegou et al., 2011]

• Split feature vector ! into v subvectors: ! " [ !1 | !2 | ... | !v ]

• Subvectors are quantized separately by quantizers

q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ]where each qi(.) is learned in a space of dimensionality D/v

• Example from [Jegou et al., 2011]: ! is a 128-dimensional vector split into 8 subvectors of dimension 16

!1 !2 !3 !4 !5 !6 !7 !8

q1

q1(!1)

Vector split into m subvectors:

Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

Example: y = 128-dim vector split in 8 subvectors of dimension 16

Product quantization for nearest neighbor search

8 bits

16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

16 components

28 = 256 centroids






8 bits

16 components


y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8


256 centroids

8 bits

q2

q2(!2)

q3

q3(!3)

q4

q4(!4)

q5

q5(!5)

q6

q6(!6)

q7

q7(!7)

q8

q8(!8)

⇒ 64-bit quantization index

q1

q1(!1)

q1

q1(!1)

q1

q1(!1)

q1

q1(!1)

Efficient approximate scoring

wT Φ ≈ wT q(Φ) =v�

j=1

wTj qj(Φj)






8 bits

16 components


y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8


256 centroids

can be precomputed and stored in alook-up table

w1

w2...

wv

1.Filling the look-up table:

sub-

bloc

ks

centroids (r per sub-block)



j=1

wTj qj(Φj)






8 bits

16 components


y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8


256 centroids


w1

w2...

wv


quantization for sub-block 1:

s11

sub-

bloc

ks


inner product



j=1

wTj qj(Φj)






8 bits

16 components


y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8


256 centroids


w1

w2...

wv



s11 s12

sub-

bloc

ks


inner product



j=1

wTj qj(Φj)






8 bits

16 components


y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8


256 centroids


w1

w2...

wv



s11 s12 s13 ... ... ... ... ... ... s1r

sub-

bloc

ks


inner product



j=1

wTj qj(Φj)






8 bits

16 components


y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8


256 centroids


w1

w2...

wv


s11 s12 s13 ... ... ... ... ... ... s1r

s21

sub-

bloc

ks

quantization for sub-block 2: centroids (r per sub-block)

inner product



j=1

wTj qj(Φj)






8 bits

16 components


y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8


256 centroids


s11 s12 s13 ... ... ... ... ... ... s1r

s21 s22 s23 ... ... ... ... ... ... s2r

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...

... ... ... ... ... ... ... ... ... ...sv1 sv2 sv3 ... ... ... ... ... ... svr

sub-

bloc

ks

2.Score each quantized vector in the database using the look-up table:

q(Φ)

wT q(Φ) = wT1 q1(Φ1) + wT

2 q2(Φ2) + . . . + wTv qv(Φv)

wT q(Φ) = wT1 q1(Φ1) + wT

2 q2(Φ2) + . . . + wTv qv(Φv)


Only v additions per image!

Choice of parameters! [Rastegari et al., 2011]

• Dimensionality is first reduced with PCA from D=2659 to D’ < D

• How do we choose D’, v (number of sub-blocks), r (number of centroids per sub-block)?

0 0.05 0.1 0.15 0.2 0.25 0.30

5

10

15

20


Prec

isio

n @

10

(%)

D’=512D’=256D’=128

(32,28)

(64,26)(64,26)

(16,28)

(128,28)

(32,28)

(16,28)

(32,28)

(64,26)

(128,28)

(256,26)(256,28)

(16,28)

• Effect of parameter choices on a database of 150K images:

(v,r)

432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485

486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539

ICCV#1745

ICCV#1745


0 0.5 1 1.5 2 2.50

5

10

15

20

25


Prec

isio

n @

10

(%)

AR L2−SVMTkP L1−LRTkP L2−SVMTkP FGM

Figure 1. Class-retrieval precision versus search time for theILSVRC2010 data set: x-axis is search time; y-axis shows per-centage of true positives ranked in the top 10 using a databaseof 150,000 images (with n!

test = 149, 850 distractors andn+

test = 150 true positives for each query class). The curve foreach method is obtained by varying parameters controlling theaccuracy-speed tradeoff (see details in the text).

subdivide a vector according to the order of components sothat the j-th sub-block would consist of the consecutive fea-ture entries from position (1 + (j ! 1)D!/v) to (jD!/v).However, such strategy would blindly allocate the samenumber of centroids for the most informative components(the ones in the first sub-block) as well as for the least in-formative. We address this problem using the solution pro-posed in [13]: we apply a random orthogonal transforma-tion after PCA so that the variances of the resulting com-ponents will be more even. We then quantize the examplesand train our retrieval models in this space.

6. ExperimentsIn this section we empirically evaluate the proposed al-

gorithms and the several possible parameter options onchallenging data sets under the performance measures ofretrieval accuracy, speed and memory usage. We denote thetop-k pruning method with TkP and the approximate rank-ing technique with AR.

Retrieval evaluation on ILSVRC2010 (150K images).We first evaluate our methods using the data set ofthe Large Scale Visual Recognition Challenge 2010(ILSVRC2010) [1], which includes images of 1000 differ-ent categories. We use a subset of the ILSVRC2010 trainingset to learn the classifiers: for each of the 1000 classes, wetrain a classifier using n+ = 50 positive examples (i.e., im-ages belonging to the query category) and n" = 999 nega-tive examples obtained by sampling one image from each of

the other classes. To cope with the largely unequal numberof positive and negative examples (n" >> n+) we nor-malize the loss term for each example in eq. 1 by the sizeof its class. We evaluate the learned retrieval models on theILSVRC2010 test set, which includes 150,000 images, with150 examples per category. Thus, the database containsn+

test = 150 true positives and n"

test = 149, 850 distrac-tors for each query. Figure 1 shows precision versus searchtime for AR and TkP in combination with different classi-fication models. Since AR does not use sparsity to achieveefficiency, we only paired it with the L2-SVM model. Thex-axis shows average retrieval time per query, measured ona single-core computer with 16GB of RAM and an IntelCore i7-930 CPU @ 2.80GHz. The y-axis reports precisionat 10 which measures the proportion of true positives in thetop 10. The times reported for TkP were obtained usingk = 10. The curve for AR was generated by varying theparameter choices for v andw, as discussed in further detaillater. The performance curves for “TkP L1-LR” and “TkPL2-SVM” were produced by varying the regularization hy-perparameterC in eq. 1. While C is traditionally viewed ascontrolling the bias-variance tradeoff, in our context it canbe interpreted as a parameter balancing generalization ac-curacy versus sparsity, and thus retrieval speed. In the caseof “TkP FGM” we have kept a constant C (tuned by cross-validation), and instead varied the sparsity of this classifierby acting on the separate parameter B. From this figurewe see that AR is overall the fastest method at the expenseof search accuracy: a peak precision of 22.6% is obtainedby TkP using L2-SVM but AR with the same classificationmodel achieves only a top precision of 17.5% due to a com-bination of fewer learning parameters (in this experimentwe used D! = 512), PCA projection error and quantizationerror. As expected, we note that TkP runs faster when usedin combination with L1-LR or FGM rather than L2-SVM,since it benefits from sparsity in the parameter vectors toeliminate images from consideration. However, we see thatsparsity negatively affects accuracy, with L2-SVM provid-ing clearly much better precision compared to L1-LR.

In our experiments we found that TkP typically ex-hibits faster retrieval in conjunction with L1-LR rather thanFGM. We can gain an intuition on the reasons by inspect-ing the average distribution of weight absolute values infigure 2(a). The average distribution for each classifica-tion model was obtained by first sorting the weight abso-lute values for each query in descending order and thennormalizing by the largest absolute value. For this exper-iment we chose B = 1000 for the FGM model. We cansee that although for this setting the weight vectors learnedby FGM are on average more sparse than those producedby L1-LR, the normalized magnitude of the L1-LR weightsdecays much faster. TkP benefits from the presence of thesehighly skewed weight magnitudes to produce more aggres-

5

Performance evaluation on 150K images

• Performance averaged over 1000 object classes used as queries

• 50 training examples per query class

• Database includes 150 images of the query class and 150K images of other classes

• Prec@10 of a random classifiers is 0.1%

approximate ranking

Memory requirements for 10M images

1 2 30

2

4

6

8

9 Gbytes

3 Gbytes

1.8 Gbytes

mem

ory

usag

e

Inverted index

Incidence matrix(used by TkP)

Product quantization index

Conclusions and open questions

• Compact descriptor enabling efficient novel-class recognition (less than 200 bytes/image yet it produces performance similar to MKL at a tiny fraction of the cost)

• Questions currently under investigation:- can we learn better classemes from fully-labeled data?- can we decouple the descriptor size from the number of classeme classes?- can we encode spatial information ([Li et al. NIPS10])?

• Software for classeme extraction available at:http://vlg.cs.dartmouth.edu/projects/classemes_extractor/

Classemes:

Information retrieval approaches to large-scale object-class search:

• sparse representations and retrieval models

• top-k ranking

• approximate scoring


Outline

1 ICVSS 2011






The Life of Structured Learned Dictionaries

Guillermo SapiroUniversity of Minnesota

G. Yu and S. Mallat (Inverse problems via GMM)G. Yu and F. Leger (Matrix completion)G. Yu (Statistical compressed sensing)

A. Castrodad (activity recognition in video)M. Zhou, D. Dunson, and L. Carin (video layers separation)

1

Friday, July 8, 2011

ExamplesZooming

Inpainting

Deblurring

Inverse Problemsy = Uf +w

w ∼ N (0,σ2Id)

f : maskingU

: subsamplingU : convolutionw ∼ N (0,σ2Id)U

3


• High computational complexity.

• Huge numbers of parameters to estimate.

• Behavior not well understood (results starting to appear).

• Dictionary learning

• Non-convex.

Dictionary learning

minD,{ai}1≤i≤I

�

1≤i≤I

��fi −Dai�2 + λ�ai�1

�

• Better performance than pre-fixed dictionaries.

Learned Overcomplete Dictionaries

11


Sparse Inverse Problem Estimation

provides a sparse representation for .

provides a sparse representation for .

Sparse estimation of from

Inverse problem estimation

wherey = Uf +w w ∼ N (0,σ2Id)

f

f = Da+ �Λ |Λ| � |Γ| , ��Λ�2 � �f�2Λ = support(a)

D = {φm}m∈Γ

UD = {Uφm}m∈Γ y

y = UDa+ ��Λ with |Λ| � |Γ| , andΛ = support(a)

a ya = argmin

a�UDa− y�2 + λ �a�1

f = Da

• Sparse inverse problem estimation

• Observation

• Sparse prior

with and

12

��Λ�2 � �y�2


Structured Representation and Estimation

D B1 B2 B3 B4 B5Overcomplete dictionary Structured overcomplete dictionary

• Dictionary: union of PCAs

• Union of orthogonal bases

• In each basis, the atoms are ordered:

• Piecewise linear estimation (PLE)

• A linear estimator per basis

• Non-linear basis selection: a best linear estimator is selected

• Small degree of freedom, fast computation, state-of-the-art performance

λk1 ≥ λk

2 ≥ · · · ≥ λkN

D = {Bk}1≤k≤K

16


Gaussian Mixture Models

• Estimate from {(µk,Σk)}1≤k≤K

• Identify the Gaussian that generates ,ki ∀i

• Estimate from , N (µki ,Σki) ∀i

{yi}1≤i≤I

fi

fi

18

whereyi = Uifi +wi wi ∼ N (0,σ2Id)


Structured Sparsity• PCA (Principal Component Analysis)

• , eigenvalues.λk1 ≥ λk

2 ≥ · · · ≥ λkN

• PCA transform

• MAP with PCA

⇔

fki = Bkaki

Σk = BkSkBTk

Sk = diag(λk1 , . . . ,λ

kN )

aki = argminai

��UiBkai − yi�2 + σ2

N�

m=1

|ai[m]|2

λkm

�

fki = argminfi

��Uifi − yi�2 + σ2fTi Σ−1

k fi�

22

• PCA basis, orthogonal.Bk = {φkm}1≤m≤N


Structured Sparsity

aki = argminai

��UiBkai − yi�2 + σ2

N�

m=1

|ai[m]|2

λkm

�Piecewise linear estimate Sparse estimate v.s.

D B1 B2 B3 B4 B5

• Nonlinear basis selection, degree of freedom . K

Full degree of freedom in atom selection

�|Γ||Λ|

�

• Linear collaborative filtering in each basis.

23

ai = argminai

�UDai − yi�2 + λ

|Γ|�

m=1

|ai[m]|


Initial Experiments: Evolution

24

Clustering 1st iteration

Clustering 2nd iteration


Experiments: Inpainting

Original 20% available MCA 24.18 dB ASR 21.84 dB

KR 21.55 dB FOE 21.92 dB BP 25.54 dB PLE 27.65 dB

[Elad, Starck, Querre, Donoho, 05] [Guleryuz, 06]

[Takeda, Farsiu. Milanfar, 06] [Roth and Black, 09] [Zhou, Sapiro, Carin, 10]26


Experiments: Zooming

Original Bicubic 28.47 dB SAI 30.32 dB PLE 30.64 dBSR 23.85 dB

Low-resolution

SAI [Zhang and Wu, 08]SR [Yang, Wright, Huang, Ma, 09]

29


Experiments: Zooming Deblurring

f Uf y = SUf

Iy PLE 30.49 dB SR 28.93 dB29.40 dB

[Yang, Wright, Huang, Ma, 09]

32


Experiments: Denoising

34

Original Noisy 22.10 dB NLmeans 28.42 dB

FOE 25.62 dB BM3D 30.97 dB PLE 31.00 dB

[Buades et al, 06]

[Roth and Black, 09] [Dabov et al, 07]


Summary of this part

• Gaussian mixture models and MAP-EM work well for image inverse problems.

• Piecewise linear estimation, connection to structured sparsity.

• Nonlinear best basis selection, small degree of freedom.

• Faster computation than sparse estimation.

• Results in the same ballpark of the state-of-the-art.

• Beyond images: recommender systems and audio (Sprechmann & Cancela)

• Statistical compressed sensing38

• Collaborative linear filtering.


Modeling and Learning Human Ac2vity

Alexey Castrodad1,2 and Guillermo Sapiro2 1 NGA Basic and Applied Research

2 University of Minnesota, ECE Department [email protected] , [email protected]

Mo2va2on •  Problem: Given volumes of video feed, detect ac2vi2es of interest

§  Mostly done manually! •  Solving this will:

§  Aid the operator: surveillance/security, gaming, psychological research

§  SiV through large amounts of data •  Solu2on: Fully/semi-‐automa2c ac2vity detec2on with

minimum human interac2on §  Invariance to spa2al transforma2ons §  Robust to occlusions, low resolu2on, noise §  Fast and accurate §  Simple, generic

4

Sparse modeling: Dic2onary learning from data

7

Sparse modeling for ac2on classifica2on: Phase 1

•  Input Videos

•  Spa2al Temporal Features

•  Sparse Modeling

9

•  l1 Pooling

Classifier output

D1 D2 D3 D

A1 A2 A3

New video

Feature Extrac2on

Sparse coding

Classifica2on

Training Class 1 Class 2 Class 3

Sparse modeling for ac2on classifica2on: Phase 2

•  Sparse Modeling

•  Inter-‐class Modeling

10

•  l1 Pooling

D1 D2 D3 D

Training Videos

Feature Extrac2on

Sparse coding

Training

Classifica2on

New video

Feature Extrac2on

Sparse coding

E1 E2 E3

Sparse Coding

A1 A2 A3

Classifier output from Phase 1

Results •  YouTube Ac2on Dataset

§  variable spa2al resolu2on videos, 3-‐8 seconds each §  11 types of ac2ons from YouTube videos

18

Scene AcGons Camera ResoluGon Frame Rate

indoors/outdoors basketball shoo2ng, cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, walking with a dog

jiaer, scale varia2ons, camera mo2on, variable illumina2on condi2ons, high background cluaer

variable, resampled to 320 x 240

25 fps

Results: YouTube Ac2on Dataset §  Best/recent reported: 75.8% (Q.V. Le et al., 2011); 84.2%

(Wang et al., 2011) §  Recogni2on rate: 80.29 % (phase 1) and 91.9% (phase 2)

20

Conclusion •  Main contribu2on:

§  Robust ac2vity recogni2on framework based on sparse modeling

§  Generic: works on mul2ple data sources §  State-‐of-‐the-‐art results in all of them, same parameters

•  Key advantage: §  Simplicity, state of the art results §  Fast and accurate: 7.5 fps §  7 frames needed for detec2on

•  Future direc2on: §  Exploit human interac2ons §  Infer the ac2ons §  Foreground extrac2on/video analysis for ac2vity clustering

21


Outline

1 ICVSS 2011






Shift-Map Image EditingYael Pritch

Eitam Kav-VenakiShmuel Peleg

The Hebrew University of Jerusalem

Retargeting (Avidan and Shamir SIGGRAPH’07, Wolf et al., ICCV’07, Wang et al., SIGASIA’08, Rubinstein et al., SIGGRAPH’08, Rubinstein et al.,SIGGRAPH’09)

Input

Geometrical Image Editing:Retargeting

Shift-MapOutput

Geometrical Image Editing:Inpainting

Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

InputMask

OutputMask

Geometrical Image Editing:Inpainting

Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)

Shift-Map Composition

User Constraints

A B C D


User Constraints

A B C D

A


User Constraints

A B C D

A B


User Constraints

A B C D

A BC


User Constraints

A B C D

A B DCNo accuratesegmentation

required


User Constraints

A B C D

No accuratesegmentation

required

• Shift-Maps represent a mapping for each pixel in the output image into the input image

• The color of the output pixel is copied from corresponding input pixel

Our Approach : Shift-Map

Output : R(u,v) Input : I(x,y)

• We use relative mapping coordinate (like in Optical Flow)


(u,v)

(x,y)

Output : R(u,v) Input : I(x,y)

(u,v)

Ty = 10


Vertical ShiftsHorizontal Shifts

Shift-MapOutput Image

Input

• Minimal distortion• Adaptive boundaries• Fast optimization

Tx = 50

Tx = 0Tx = 400

Ty = 10

Tx = 50

Output

• We look for the optimal mapping - can be described as an Energy Minimization problem

Smoothness term :Avoid Stitching Artifacts

Data term : External Editing Requirement

Geometric Editing as an Energy Minimization

Compute For Each Pixel Compute For Each Pair of Neighboring pixels

• Unified representation for geometric editing applications• Solved using a graph labeling algorithm

The Smoothness Term

color

gradient

q’

p’ np’

nq’

R - Output Image I - Input Image

For p For q

(Kwatra et al. 03, Agarwala et al. 04)

p qDiscontinuity in the shift-map

• Data term varies between different application• Inpainting data term uses data mask D(x,y) over the

input image– D(x,y)= ∞ for pixels to be removed– D(x,y)=0 elsewhere

• Specific input pixels can be forced not to be included in the output image by setting D(x,y)=∞

The Data Term: Inpainting

(x,y)

0=D

(u,v)

• Input pixels can be forced to appear in a new location

(u,v)

(x,y)

• Appropriate shift gets infinitely low energy

• Other shifts getinfinitely high energy

The Data Term: Rearrangement

• Use picture borders• Can incorporate importance mask

– Order constraint on mapping is applied to prevent duplications of important areas

The Data Term: Retargeting

• Minimal energy mapping can be represented as graph labeling where the Shift-Map value is the selected labelfor each output pixel

• Labels: relative shift

Shift-Map as Graph Labeling

Output image pixels Input image

Shift Map:assign

a label to each pixel

Nodes:pixels

Labels: shift-map values (tx,ty)

Hierarchical SolutionGaussian pyramid

on input

Shift-Map

Shift-Map

Output

Shift-Map handles without additional user interaction some cases where other algorithms suggested that can only be handled with additional user guidance

Results and Comparison

J. Sun, L. Yuan, J. Jia, and H. Shum. Image completion with structure propagation. In SIGGRAPH’05

Shift-Map

Image completion with structure propagation [Sun et al. SIGGRAPH’05]

Mask

Application: Retargeting

Input Output

Non-Homogeneous[Wolf et al., ICCV’07]

PatchMatch[Barnes et al, SIGGRAPH‘09]

Improved Seam Carving [Robinstein et al, SIGGRAPH’08] Shift-Maps

Results andComparison

Summary

• New representation to geometrical editing applications as an optimal graph labeling

• Unified approach

• Solved efficiently using hierarchical approximations

• Minimal user interaction is required for various editing tasks

• Build an Output image R from pixels taken from Source image I such that R is most similar to Target image T

Similarity Guided Composition

Source ImageTarget Image Output

• Data term reflects a similarity between the output image R and a target image T

• Similarity uses both colors and gradients


• Data term indicates the similarity of the output image to the target image

• Weight between similarity and smoothness has the following effect

Source ImageTarget Image

ResultedOutput

Previous Work: Efros and Freeman 2001, Hertzman et al. 2001


Edge Preserving Magnification

Using the original image as the source, similarity guided composition can magnify

Does not work for gradual color changes

Source Target (bilinear magnification)

Result


Original image can be the source for edge areas. Otherwise the magnified image is the source.

Source 1Magnified Target

Source 2Original Edge Map


Bicubic Shift Map

Easy to compose (recover) source from target&

Easy to compose (recover) target from source

?

source target

The Bidirectional Similarity[Simakov, Caspi, Shechtman, Irani – CVPR’2008]

Completeness

⊆All source patches (at multiple scales) should be in the target

⊇ All target patches (at multiple scales) should be in the sourceCoherence

• It will he hard to reconstruct back the Fish

• Shift-Map retargeting maximize the coherence

Shift-Map Retargeting with Feedback

• Increase the Appearance Data Term of input regions with a high Composition Score E<A|B> and recompute the output B.

• Pixels with the higher Appearance Term will now appear in the output and increase the completeness.


Original

Retargeted Reconstruction of Original

Appearance Term

E<A|B>E<A|B>

FeedbackOriginal Shift-Map


Video Synopsis and IndexingMaking a Long Video Short

• 11 million cameras in 2008• Expected 30 million in 2013• Recording 24 hours a day, every day

t

Video SynopsisShift Objects in Time

Input Video I(x,y,t)

Synopsis VideoS(x,y,t)

• Detect and track objects, store in database.• Select relevant objects from database• Display selected objects in a very short

“Video Synopsis”• In “Video Synopsis”, objects from different

times can appear simultaneously• Index from selected objects into original video• Cluster similar objects

Steps in Video Synopsis

Two ClustersCars

People

Camera in St. Petersburg

• Detect specific events• Discover activity patterns

ICVSS 2011 Presentations

168.176.61.22/comp/buzones/PROCEEDINGS/ICVSS2011

Jiri Matas - Tracking, Learning, Detection, Modeling

Ivan Laptev - Human Action Recognition

Josef Sivic - Large Scale Visual Search

Andrew Fitzgibbon - Computer Vision: Truth and Beauty(Kinect)


The end...

Thanks !

Angel Cruz-Roa [email protected] Rueda-Olarte [email protected]


Education

ICVSS2011 Selected Presentations