Upload
aacruzr
View
791
Download
2
Embed Size (px)
DESCRIPTION
This is a presentation to share the experiences and selected presentation from International Computer Vision Summer School (ICVSS2011) attended by Angel Cruz and Andrea Rueda from Bioingenium Research Group of Universidad Nacional de Colombia.
Citation preview
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011: Selected Presentations
Angel Cruz and Andrea Rueda
BioIngenium Research Group, Universidad Nacional de Colombia
August 25, 2011
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - LorenzoTorresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - LorenzoTorresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011International Computer Vision Summer School
15 speakers, from USA, France, UK, Italy, Prague and Israel
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011International Computer Vision Summer School
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011International Computer Vision Summer School
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - LorenzoTorresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
A Trillion Photos
Steve SeitzUniversity of Washington
Sicily Computer Vision Summer SchoolJuly 11, 2011
>3 billion uploaded each month
~ trillion photos taken each year
What do you do with a trillion photos?
Digital Shoebox(hard drives, iphoto, facebook...)
?
Comparing images
Detect features using SIFT [Lowe, IJCV 2004]
Comparing images
Extraordinarily robust image matching– Across viewpoint (~60 degree out-of-plane rotations)
– Varying illumination
– Real-time implementations
Edges
Scale Invariant Feature Transform
Adapted from slide by David Lowe
0 2πangle histogram
NASA Mars Rover images
NASA Mars Rover imageswith SIFT feature matchesFigure by Noah Snavely
St. Peters (inside)
Trevi Fountain
St. Peters (outside)
Il Vittoriano
Coliseum(inside)
Coliseum(outside)
Forum
Structure from motion
3D structureMatched photos
Structure from motion
Camera 1
Camera 2
Camera 3R1,t1
R2,t2
R3,t3
p1
p4
p3
p2
p5
p6
p7
minimizef (R, T, P)
aka “bundle adjustment” (texts: Zisserman; Faugeras)
?
Reconstructing RomeIn a day...
From ~1M imagesUsing ~1000 cores
Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitzhttp://grail.cs.washington.edu/rome
Rome 150K: Colosseum
Rome: St. Peters
Venice (250K images)
Venice: Canal
Dubrovnik
Sparse output from the SfM system
From Sparse to Dense
From Sparse to Dense
Furukawa, Curless, Seitz, Szeliski, CVPR 2010
Most of our photos don’t look like this
recognition + alignment
Your Life in 30 Seconds
path optimization
Picasa Integration• As “Face Movies” feature in v3.8
– Rahul Garg, Ira Kemelmacher
Conclusion
trillions of photos + computer vision breakthroughs
= new ways to see the world
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - LorenzoTorresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
Efficient Novel-Class Recognition and Search
Lorenzo Torresani
• no text/tags available
• query images may represent a novel class
user-provided imagesof an object class
+
image database(e.g., 1 million photos)
• Given:
• Want:
database images
of this class
Problem statement: novel object-class search
Application: Web-powered visual search in unlabeled personal photos
1 Search the Web for images of “soccer camp”
2 Find images of this visual class on my computer
Find “soccer camp” pictures on my computer
Goal:
1
2
Application: product search
• Search of aesthetic products
image retrieval object categorization
novel class search
analogies:- large databases- efficient indexing- compact representation
differences:- simple notions of visual relevancy (e.g., near-duplicate, same object instance, same spatial layout)
Figure 5. The retrieval performance is evaluated using a largeground truth database (6376 images) with groups of four imagesknown to be taken of the same object, but under differentconditions. Each image in turn is used as query image, and thethree remaining images from its group should ideally be at thetop of the query result. In order to compare against less efficientnon-hierarchical schemes we also use a subset of the databaseconsisting of around 1400 images.
settings with a 1400 image subset of the test images. Thecurves show the distribution of how far the wanted imagesdrop in the query rankings. The points where a largernumber of methods meet the y-axis are given in Table 1.Note especially that the use of a larger vocabulary andalso L1 - norm gives performance improvements over the
Figure 6. Curves showing percentage (y-axis) of the ground truthquery images that make it into the top x percent (x-axis) framesof the query for a 1400 image database. The curves are shownup to 5% of the database size. As discussed in the text, it iscrucial for scalable retrieval that the correct images from thedatabase make it to the very top of the query, since verificationis feasible only for a tiny fraction of the database when thedatabase grows large. Hence, we are mainly interested in wherethe curves meet the y-axis. To avoid clutter, this number isgiven in Table 1 for a larger number of settings. A number ofconclusions can be drawn from these results: A larger vocabularyimproves retrieval performance. L1-norm gives better retrievalperformance than L2-norm. Entropy weighting is important, atleast for smaller vocabularies. Our best setting is method A, whichgives much better performance than the setting used by [17], whichis setting T.
settings used by [17].
The performance with various settings was also testedon the full 6376 image database. It is important to note thatthe scores decrease with increasing database size as thereare more images to confuse with. The effect of the shapeof the vocabulary tree is shown in Figure 7. The effects ofdefining the vocabulary tree with varying amounts of dataand training cycles are investigated in Figure 8.
Figure 10 shows a snapshot of a demonstration of themethod, running real-time on a 40000 image database ofCD covers, some connected to music. We have so fartested the method with a database size as high as 1 millionimages, more than one order of magnitude larger than anyother work we are aware of, at least in this category ofmethod. The results are shown in Figure 9. As we couldnot obtain ground truth for that size of database, the 6376image ground truth set was embedded in a database that alsocontains several movies: The Bourne Identity, The Matrix,Braveheart, Collateral, Resident Evil, Almost Famous andMonsters Inc. Note that all frames from the movies are in
Figure 5. The retrieval performance is evaluated using a largeground truth database (6376 images) with groups of four imagesknown to be taken of the same object, but under differentconditions. Each image in turn is used as query image, and thethree remaining images from its group should ideally be at thetop of the query result. In order to compare against less efficientnon-hierarchical schemes we also use a subset of the databaseconsisting of around 1400 images.
settings with a 1400 image subset of the test images. Thecurves show the distribution of how far the wanted imagesdrop in the query rankings. The points where a largernumber of methods meet the y-axis are given in Table 1.Note especially that the use of a larger vocabulary andalso L1 - norm gives performance improvements over the
Figure 6. Curves showing percentage (y-axis) of the ground truthquery images that make it into the top x percent (x-axis) framesof the query for a 1400 image database. The curves are shownup to 5% of the database size. As discussed in the text, it iscrucial for scalable retrieval that the correct images from thedatabase make it to the very top of the query, since verificationis feasible only for a tiny fraction of the database when thedatabase grows large. Hence, we are mainly interested in wherethe curves meet the y-axis. To avoid clutter, this number isgiven in Table 1 for a larger number of settings. A number ofconclusions can be drawn from these results: A larger vocabularyimproves retrieval performance. L1-norm gives better retrievalperformance than L2-norm. Entropy weighting is important, atleast for smaller vocabularies. Our best setting is method A, whichgives much better performance than the setting used by [17], whichis setting T.
settings used by [17].
The performance with various settings was also testedon the full 6376 image database. It is important to note thatthe scores decrease with increasing database size as thereare more images to confuse with. The effect of the shapeof the vocabulary tree is shown in Figure 7. The effects ofdefining the vocabulary tree with varying amounts of dataand training cycles are investigated in Figure 8.
Figure 10 shows a snapshot of a demonstration of themethod, running real-time on a 40000 image database ofCD covers, some connected to music. We have so fartested the method with a database size as high as 1 millionimages, more than one order of magnitude larger than anyother work we are aware of, at least in this category ofmethod. The results are shown in Figure 9. As we couldnot obtain ground truth for that size of database, the 6376image ground truth set was embedded in a database that alsocontains several movies: The Bourne Identity, The Matrix,Braveheart, Collateral, Resident Evil, Almost Famous andMonsters Inc. Note that all frames from the movies are in
query retrieved
from [Nister and Stewenius, ’07]
RBM predicted labels (63%)
32!RBM 16384-gist
wall
floor
poster
ceiling
door
RBM predicted labels (51%)
32!RBM 16384-gist
flower
RBM predicted labels (72%)
Ground truth neighbors 32!RBM 16384-gist
road
tree
car
RBM predicted labels (47%)
Ground truth neighbors 32!RBM 16384-gist
road
sky
building
car
tree
bed
RBM predicted labels (78%)
Ground truth neighbors 32!RBM 16384-gist
RBM predicted labels (56%)
32!RBM 16384-gist
building
road
tree
car
sky
sidewalk crosswalk
road
mountaintree
car
sky
Input image
Ground truth neighbors
Ground truth neighbors
Ground truth neighbors
Input image
Input image
Input image Input image
Input image
Figure 6. This figure shows six example input images. For each image, we show the first 12 nearest neighbors when using ground truth
semantic distance (see text), using 32bits RBM and the original Gist descriptor (which uses 16384 bits). Below each set of neighbors
we show the LabelMe segmentations of each image. Those segmentations and their corresponding labels are used by a pixel-wise voting
scheme to propose a segmentation and labeling of the input image. The resulting segmentation is shown below each input image. The
number above the segmentation indicates the percentage of pixels correctly labeled. A more quantitative analysis is shown in Fig. 7.
AcknowledgmentsThe authors would like to thank Geoff Hinton and Rus
Salakhutdinov for making their RBM code available online. Fund-ing for this research was provided by NSF Career award (IIS0747120), NGA NEGI- 1582-04-0004, Shell Research and ONR-MURI Grant N00014- 06-1-0734.
References
[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate
nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.
[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan.
Matching words and pictures. JMLR, 3:1107–1135, Feb 2003.
[3] A. Bosch, A. Zisserman, and X. Muoz. Representing shape with a spatial pyra-
mid kernel. In CIVR, 2006.
[4] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image seg-
mentation using expectation-maximization and its application to image query-
ing. PAMI, 24(8):1026–1038, 2002.
[5] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and
trends of the new age. ACM Computing Surveys, page to appear, 2008.
[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, M. Gorkani,
J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image
and video content: the QBIC system. IEEE Computer, 28(9):23–32, 1995.
[7] P. Ghosh, B. Manjunath, and K. Ramakrishnan. A compact image signature for
RTS-invariant image retrieval. In IEE VIE, Sep 2006.
[8] J. Goldberger, S. T. Roweis, R. R. Salakhutdinov, and G. E. Hinton. Neighbor-
hood components analysis. In NIPS, 2004.
[9] M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos at a
glance. In Intl. Conf. Pattern Recognition, volume 1, pages 459–464, 1994.
[10] K. Grauman and T. Darrell. Pyramid match hashing: Sub-linear time indexing
over partial correspondences. In Proc. CVPR, 2007.
[11] J. Hayes and A. Efros. Scene completion using millions of photographs. SIG-
GRAPH, 2007.
[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data
with neural networks. Nature, 313(5786):504–507, July 2006.
[13] I. Kunttu, L. Lepisto, J. Rauhamaa, and A. Visa. Binary histogram in image
classification for retrieval purposes. In Intl. Conf. in Central Europe on Com-
puter Graphics, Visualization and Computer Vision, pages 269–273, 2003.
[14] J. Landre and F. Truchetet. Optimizing signal and image processing applica-
tions using intel libraries. In Proc. of QCAV 2007, page to appear, Le Creusot,
France, May 2007. SPIE.
[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyra-
mid matching for recognizing natural scene categories. In CVPR, pages 2169–
2178, 2006.
[16] T. Liu, A. W. Moore, and A. Gray. Efficient exact kNN and non-parametric
classification in high dimensions. In NIPS, 2004.
[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,
60:91–110, 2004.
[18] M. Nascimento and V. Chitkara. Color-based image retrieval using binary sig-
natures. In ACM symposium on Applied computing, pages 687–692, 2002.
[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In
Proc. CVPR, pages 2161–2168, 2006.
[20] S. Obdrzalek and J. Matas. Sub-linear indexing for large scale object recogni-
tion. In BMVC, 2005.
[21] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic repre-
sentation of the spatial envelope. International Journal in Computer Vision,
42:145–175, 2001.
[22] T. Quack, U. Monich, L. Thiele, and B. Manjunath. Cortina: A system for
large-scale, content-based web image retrieval. In ACM Multimedia, 2004.
[23] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a
database and web-based tool for image annotation. Technical Report AIM-
2005-025, MIT AI Lab Memo, September, 2005.
[24] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by
preserving class neighbourhood structure. In AISTATS, 2007.
[25] R. R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR workshop
on Information Retrieval and applications of Graphical Models, 2007.
[26] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter
sensitive hashing. In Proc. ICCV, 2003.
[27] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collec-
tions in 3d. In SIGGRAPH, pages 835–846, 2006.
[28] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report
MIT-CSAIL-TR-2007-024, Computer Science and Artificial Intelligence Lab,
Massachusetts Institute of Technology, 2007.
[29] J. Wang, G. Wiederhold, O. Firschein, and S. Wei. Content-based image index-
ing and searching using daubechies’ wavelets. Int. J. Digital Libraries, 1:311–
328, 1998.
RBM predicted labels (63%)
32!RBM 16384-gist
wall
floor
poster
ceiling
door
RBM predicted labels (51%)
32!RBM 16384-gist
flower
RBM predicted labels (72%)
Ground truth neighbors 32!RBM 16384-gist
road
tree
car
RBM predicted labels (47%)
Ground truth neighbors 32!RBM 16384-gist
road
sky
building
car
tree
bed
RBM predicted labels (78%)
Ground truth neighbors 32!RBM 16384-gist
RBM predicted labels (56%)
32!RBM 16384-gist
building
road
tree
car
sky
sidewalk crosswalk
road
mountaintree
car
sky
Input image
Ground truth neighbors
Ground truth neighbors
Ground truth neighbors
Input image
Input image
Input image Input image
Input image
Figure 6. This figure shows six example input images. For each image, we show the first 12 nearest neighbors when using ground truth
semantic distance (see text), using 32bits RBM and the original Gist descriptor (which uses 16384 bits). Below each set of neighbors
we show the LabelMe segmentations of each image. Those segmentations and their corresponding labels are used by a pixel-wise voting
scheme to propose a segmentation and labeling of the input image. The resulting segmentation is shown below each input image. The
number above the segmentation indicates the percentage of pixels correctly labeled. A more quantitative analysis is shown in Fig. 7.
AcknowledgmentsThe authors would like to thank Geoff Hinton and Rus
Salakhutdinov for making their RBM code available online. Fund-ing for this research was provided by NSF Career award (IIS0747120), NGA NEGI- 1582-04-0004, Shell Research and ONR-MURI Grant N00014- 06-1-0734.
References
[1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate
nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.
[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan.
Matching words and pictures. JMLR, 3:1107–1135, Feb 2003.
[3] A. Bosch, A. Zisserman, and X. Muoz. Representing shape with a spatial pyra-
mid kernel. In CIVR, 2006.
[4] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image seg-
mentation using expectation-maximization and its application to image query-
ing. PAMI, 24(8):1026–1038, 2002.
[5] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and
trends of the new age. ACM Computing Surveys, page to appear, 2008.
[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, M. Gorkani,
J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image
and video content: the QBIC system. IEEE Computer, 28(9):23–32, 1995.
[7] P. Ghosh, B. Manjunath, and K. Ramakrishnan. A compact image signature for
RTS-invariant image retrieval. In IEE VIE, Sep 2006.
[8] J. Goldberger, S. T. Roweis, R. R. Salakhutdinov, and G. E. Hinton. Neighbor-
hood components analysis. In NIPS, 2004.
[9] M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos at a
glance. In Intl. Conf. Pattern Recognition, volume 1, pages 459–464, 1994.
[10] K. Grauman and T. Darrell. Pyramid match hashing: Sub-linear time indexing
over partial correspondences. In Proc. CVPR, 2007.
[11] J. Hayes and A. Efros. Scene completion using millions of photographs. SIG-
GRAPH, 2007.
[12] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data
with neural networks. Nature, 313(5786):504–507, July 2006.
[13] I. Kunttu, L. Lepisto, J. Rauhamaa, and A. Visa. Binary histogram in image
classification for retrieval purposes. In Intl. Conf. in Central Europe on Com-
puter Graphics, Visualization and Computer Vision, pages 269–273, 2003.
[14] J. Landre and F. Truchetet. Optimizing signal and image processing applica-
tions using intel libraries. In Proc. of QCAV 2007, page to appear, Le Creusot,
France, May 2007. SPIE.
[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyra-
mid matching for recognizing natural scene categories. In CVPR, pages 2169–
2178, 2006.
[16] T. Liu, A. W. Moore, and A. Gray. Efficient exact kNN and non-parametric
classification in high dimensions. In NIPS, 2004.
[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,
60:91–110, 2004.
[18] M. Nascimento and V. Chitkara. Color-based image retrieval using binary sig-
natures. In ACM symposium on Applied computing, pages 687–692, 2002.
[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In
Proc. CVPR, pages 2161–2168, 2006.
[20] S. Obdrzalek and J. Matas. Sub-linear indexing for large scale object recogni-
tion. In BMVC, 2005.
[21] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic repre-
sentation of the spatial envelope. International Journal in Computer Vision,
42:145–175, 2001.
[22] T. Quack, U. Monich, L. Thiele, and B. Manjunath. Cortina: A system for
large-scale, content-based web image retrieval. In ACM Multimedia, 2004.
[23] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: a
database and web-based tool for image annotation. Technical Report AIM-
2005-025, MIT AI Lab Memo, September, 2005.
[24] R. R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by
preserving class neighbourhood structure. In AISTATS, 2007.
[25] R. R. Salakhutdinov and G. E. Hinton. Semantic hashing. In SIGIR workshop
on Information Retrieval and applications of Graphical Models, 2007.
[26] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter
sensitive hashing. In Proc. ICCV, 2003.
[27] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring photo collec-
tions in 3d. In SIGGRAPH, pages 835–846, 2006.
[28] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report
MIT-CSAIL-TR-2007-024, Computer Science and Artificial Intelligence Lab,
Massachusetts Institute of Technology, 2007.
[29] J. Wang, G. Wiederhold, O. Firschein, and S. Wei. Content-based image index-
ing and searching using daubechies’ wavelets. Int. J. Digital Libraries, 1:311–
328, 1998.
from [Torralba et al., ’08]
(a)
(b)
(c)
(d)
Figure 4. Examples of searching the 5K dataset for: (a) All Soul’s College. (b) Bridge of sighs, Hertford College. (c) Ashmolean Museum.
(d) Bodleian window. The query is shown on the left, with selected top ranked retrieved images shown to the right. All results displayed
are returned before the first false positive for each query.
Acknowledgements. We thank David Lowe for discussionsand for providing his k-d tree code and Henrik Steweniusfor providing his dataset for comparison. We are gratefulfor support from the Royal Academy of Engineering, theEU Visiontrain Marie-Curie network, the EPSRC and Mi-crosoft.
References[1] http://www.robots.ox.ac.uk/!vgg/data/.
[2] http://www.vis.uky.edu/!stewe/ukbench/data/.
[3] http://www.flickr.com/.
[4] Y. Aasheim, M. Lidal, and K. Risvik. Multi-tier architecture
for web search engines. In Proc. Web Congress, 2003.
[5] Y. Amit and D. Geman. Shape quantization and recognition
with randomized trees. Neural Computing, 9(7):1545–1588,
1997.
[6] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu.
An optimal algorithm for approximate nearest neighbor
searching fixed dimensions. Journal of the ACM, 45(6):891–
923, 1998.
[7] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. ACM Press, ISBN: 020139829, 1999.
[8] L. Barroso, J. Dean, and U. Holzle. Web search for a planet:
The google cluster architecture. Micro, IEEE, 23, 2003.
[9] O. Chum, J. Matas, and S. Obdrzalek. Enhancing RANSAC
by generalized model optimization. In Proc. ACCV, 2004.
[10] C. Elkan. Using the triangle inequality to accelerate kmeans,
2003.
[11] V. Ferrari, T. Tuytelaars, and L. Van Gool. Simultaneous
object recognition and segmentation by image exploration.
In Proc. ECCV, 2004.
[12] M. A. Fischler and R. C. Bolles. Random sample consensus.
Comm. ACM, 24(6):381–395, 1981.
[13] A. Gersho and R. Gray. Vector quantization and signal com-
pression. Kluwer Academic Publishers, Boston, 1992.
[14] R. I. Hartley and A. Zisserman. Multiple View Geometry
in Computer Vision. Cambridge University Press, ISBN:
0521540518, second edition, 2004.
[15] V. Lepetit, P. Lagger, and P. Fua. Randomized trees for real-
time keypoint recognition. In Proc. CVPR, June 2005.
[16] D. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91–110, 2004.
[17] K. Mikolajczyk, B. Leibe, and B. Schiele. Multiple object
class detection with a generative model. In Proc. CVPR,
2006.
[18] K. Mikolajczyk and C. Schmid. Scale & affine invariant in-
terest point detectors. IJCV, 1(60):63–86, 2004.
[19] F. Moosman, B. Triggs, and F. Jurie. Randomized clustering
forests for building fast and discriminative visual vocabular-
ies. In NIPS, 2006.
[20] D. Nister and H. Stewenius. Scalable recognition with a vo-
cabulary tree. In Proc. CVPR, 2006.
[21] S. Obdrzalek and J. Matas. Sub-linear indexing for large
scale object recognition. In Proc. BMVC., 2005.
[22] U. Shaft, J. Goldstein, and K. Beyer. Nearest neighbours
query performance. Technical report, 1998.
[23] C. Silpa-Anan and R. Hartley. Localization using an image-
map. In Proc. ACRA, 2004.
[24] J. Sivic and A. Zisserman. Video Google: A text retrieval
approach to object matching in videos. In Proc. ICCV, Oct
2003.
[25] J. Winn and A. Criminisi. Object class recognition at a
glance. In In Video Proc. CVPR, 2006.
from [Philbin et al., ’07]
Relation to other tasks
image retrieval object classification
analogies:- large databases- efficient indexing- compact representation
differences:- simple notions of visual relevancy (e.g., near-duplicate, same object instance, same spatial layout)
analogies:- recognition of object classes from a few examples
differences:- classes to recognize are defined a priori- training and recognition time is unimportant- storage of features is not an issue
novel class search
Relation to other tasks
Technical requirements of novel class-search
• The object classifier must be learned on the fly from few examples
• Recognition in the database must have low computational cost
• Image descriptors must be compact to allow storage in memory
State-of-the-art inobject classification
Winning recipe: many features + non-linear classifiers(e.g. [Gehler and Nowozin, CVPR’09])
!"#$%
&'()*+),%%-'.,()*+/%
#"0$%
...
!"#$%&#'()*+&,-)&.&#(#/*01#-2"#*
non-linear decision boundary
Model evaluation on Caltech256
0 5 10 15 20 25 300
5
10
15
20
25
30
35
40
45
number of training examples
accu
racy
(%)
gistphogphog2pissimbow5000
!"#$%&'()*$+',
'"#*"-"*.%+'/$%0.&$1
Model evaluation on Caltech256
0 5 10 15 20 25 300
5
10
15
20
25
30
35
40
45
number of training examples
accu
racy
(%)
gistphogphog2pissimbow5000linear combination
!"#$%&'()*$+',
'"#*"-"*.%+'/$%0.&$1
!"#$%&'()*$+',/$%0.&$'2)(3"#%4)#
!"#$%&'()*$+',
'"#*"-"*.%+'/$%0.&$1
!"#$%&'()*$+',/$%0.&$'2)(3"#%4)#
5)#6+"#$%&'()*$+',/$%0.&$'2)(3"#%4)#'7%898%8':.+4;+$'<$&#$+'
!$%&#"#=>'?@$A+$&'B'5)C)D"#E'FGH
Model evaluation on Caltech256
0 5 10 15 20 25 300
5
10
15
20
25
30
35
40
45
number of training examples
accu
racy
(%)
gistphogphog2pissimbow5000linear combinationnonlinear combination
Multiple kernel combinersClassification output is obtained by combining many features via non-linear kernels:
h(x) =F�
f=1
βf
N�
n=1
kf (x,xn)αn + b
sum over features sum over training examples
!"#$%
&'()*+),%%-'.,()*+/%
#"0$%
...where
Multiple kernel learning (MKL)
1. a linear combination of kernels: 2. the SVM parameters:
A typical example of such a feature fm would be a bag-of-visual-words histogram of the image. Then, the correspond-ing dimensionality dm would be the codebook size used forthe vector quantization step. In the following, we will usethe name feature combination method for all methods whichaddress the feature combination problem.
Kernel methods. The object classification problem is aspecial case of multiclass classification. In computer visionthe problem of learning a multiclass classifier from trainingdata is often addressed by means of kernel methods. Kernelmethods make use of kernel functions defining a measureof similarity between pairs of instances. In the context offeature combination it is useful to associate a kernel to eachimage feature as follows. For a kernel function k betweenreal vectors we define the short-hand notation
km(x, x�) = k(fm(x), fm(x�)),
such that the image kernel km : X × X → R only con-siders similarity with respect to image feature fm. If theimage feature is specific to a certain aspect, say, it only con-siders texture information, then the kernel measures simi-larity only with regard to this aspect. The subscript m ofthe kernel can then be understood as indexing into the set offeatures.
In the following, for notational convenience, we will de-note the kernel response of the m’th feature for a givensample x ∈ X to all training samples xi, i = 1, . . . , Nas Km(x) ∈ RN with
Km(x) = [km(x, x1), km(x, x2), . . . , km(x, xN )]T .
In case x is the i’th training sample, i.e. x = xi, thenKm(x) is simply the i’th column of the m’th kernel matrix.
Feature selection as kernel selection In this paper westudy a class of kernel classifiers that aim to combine sev-eral kernels into a single model. Since we associate imagefeatures with kernel functions, kernel combination/selectiontranslates naturally into feature combination/selection.
A conceptually simple approach is the use of CrossValidation (CV) to select a single kernel from the set{k1, . . . , kF }. Every feature combination method shouldbe able to outperform this baseline method or at least matchits performance if a single feature is sufficient for good clas-sification.
In the following we will present several methods in aunified setting along with their training procedures. Anoverview of the different methods in their multiclass vari-ant can also be found in the Table 1.
3. Methods: BaselinesWe include two simple baseline methods, both of which
combine kernels in a pre-defined deterministic way and sub-sequently use the resulting kernel for SVM training.
3.1. Averaging KernelsArguably the simplest method to combine several ker-
nels is to average them. We define the kernel functionk∗(x, x�) = 1
F
�Fm=1 km(x, x�), which is subsequently
used in a support vector machine (SVM).
Training The only free parameters are the SVM parame-ters. We use CV to estimate the best regularization constant.A multiclass variant is build using a one-versus-all scheme.3.2. Product Kernels
The next baseline method we consider is to combineseveral kernels by multiplication. In this case we usek∗(x, x�) = (
�Fm=1 km(x, x�))1/F as the single kernel in
a SVM.
Training Same as for averaging.
4. Methods: Multiple Kernel LearningAnother approach to perform kernel selection is to learn
a kernel combination during the training phase of the al-gorithm. One prominent instance of this class is MKL. Itsobjective is to optimize jointly over a linear combination ofkernels k∗(x, x�) =
�Fm=1 βmkm(x, x�) and the parame-
ters α ∈ RN and b ∈ R of an SVM.MKL was originally introduced in [1]. For efficiency and
in order to obtain sparse, interpretable coefficients, it re-stricts βm ≥ 0 and imposes the constraint
�Fm=1 βm = 1.
Since the scope of this paper is to access the applicabilityof MKL to feature combination rather than its optimizationpart we opted to present the MKL formulations in a way al-lowing for easier comparison with the other methods. Wewrite its objective function as
minα,β,b
12
F�
m=1
βmαT Kmα (1)
+CN�
i=1
L(yi, b +F�
m=1
βmKm(x)T α)
sb.t.F�
m=1
βm = 1, βm ≥ 0, m = 1, . . . , F,
where L(y, t) = max(0, 1 − yt) denotes the Hinge loss.We compare two different algorithms solving this problemfor their runtime performance, namely SILP [18]2 and Sim-pleMKL [17]3.
The final binary decision function of MKL is of the fol-lowing form
FMKL(x) = sign
�F�
m=1
βm(Km(x)T α + b)
�. (2)
2Available online:www.shogun-toolbox.org/3Available online:mloss.org/software/view/174/
where
Learning a non-linear SVM by jointly optimizing over
[Bach et al., 2004; Sonnenburg et al., 2006; Varma and Ray, 2007]
k∗(x, x�) =F�
f=1
βfkf (x, x�)
minα,β,b
12
F�
f=1
βfαT Kfα + CN�
n=1
L
yn, b +F�
f=1
βfKf (xn)T α
subject toF�
f=1
βf = 1, βf ≥ 0, f = 1, . . . , F
L(y, t) = max(0, 1− yt)Kf (x) = [kf (x, x1), kf (x, x2), . . . , kf (x, xN )]T
LP-β: a two-stage approach to MKL! [Gehler and Nowozin, 2009]
• Classification output of traditional MKL:
1. train each independently → traditional SVM learning
2. optimize over → a simple linear programβ
Two-stage training procedure:
hf (x)
• Classification function of LP-β:
� �� �hf (x)
h(x) =F�
f=1
βf
�N�
n=1
kf (x,xn)αfn + bf
�
hMKL(x) =F�
f=1
βf
�N�
n=1
kf (x,xn)αn + b
�
LP-β for novel-class search?
The LP-β classifier:
Unsuitable for our needs due to:
• large storage requirements (typically over 20K bytes/image)• costly evaluation (requires query-time kernel distance
computation for each test image)• costly training (1+ minute for O(10) training examples)
sum over features sum over training examples
h(x) =F�
f=1
βf
�N�
n=1
kf (x,xn)αfn + bf
�
Classemes: a compact descriptor for
Key-idea: represent each image in terms of its “closeness” to a set of basis classes (“classemes”)
output of a pre-learned LP-β for the c-th basis class
φc(x) = hclassemec(x) =F�
f=1
βcf
N�
n=1
kf (x,xcn)αc
n + bc
Φ(x) = [φ1(x), . . . ,φC(x)]Tx
x
! [Torresani et al., 2010]efficient recognition
� �� �LP-β trained before the creation of the database
trained at query-time
gduck(Φ(x);wduck) = Φ(x)T wduck =C�
c=1
wduckc
F�
f=1
βcf
N�
n=1
kf (x,xcn)αc
n + bc
...Φ(x1) Φ(xN)Query-time learning: train a linear classifier on Φ(x)
training examples of novel class
How this works...• Accurate semantic labels are not required...
• Classeme classifiers are just used as detectors for specific patterns of texture, color, shape, etc.
E!cient Object Category Recognition Using Classemes 777
Table 1. Highly weighted classemes. Five classemes with the highest LP-! weightsfor the retrieval experiment, for a selection of Caltech 256 categories. Some may appearto make semantic sense, but it should be emphasized that our goal is simply to createa useful feature vector, not to assign semantic labels. The somewhat peculiar classemelabels reflect the ontology used as a source of base categories.
!"#$%&'"()*+$ ,-(./+$#"-(.'"0$%/&11"2"1$
%)#3)+4.&'$ !"#$"%& '()*%'+&%*,-.& -,."+&(,/& -)##"-%01#"& $2330/+&(,/&
05%6$ 1)$1"*+&(#,/"& 1)45+&)3+&6,%"*& '60$$"*& 6,#.0/7& '%*,07!%&
"/6$ 3072*"+&'.,%"*&12##+&$,#"+&!"*4+&
,/0$,#&-,%%#"& 7*,8"'0%"& 4",4+&1)45&
7*-13""$ 6,%"*-*,3%+&'2*3,-"& '-'0+&-,1#"& ,#,*$+&-#)-.& !0/42& '"*80/7+&%*,5&
'*-/)3-'"4898$ -)/8"9+&%!0/7& $0/"4+&,*",& -4(#,5"*& *),'%0/7+&(,/&6"'%"*/+&!"$0'(!"*"+&
("*')/&
#.""/3&**)#$%,.0/7+&-,*"+&)3+&
')$"%!0/7&1,77,7"+&()*%"*& -,/)(5+&-#)'2*"+&)("/& *)60/7+&'!"##&
-)/%,0/"*+&(*"''2*"+&
1,**0"*&
Large-scale recognition benefits from a compact descriptor for each image,for example allowing databases to be stored in memory rather than on disk. Thedescriptor we propose is 2 orders of magnitude more compact than the state ofthe art, at the cost of a small drop in accuracy. In particular, performance of thestate of the art with 15 training examples is comparable to our most compactdescriptor with 30 training examples.
The ideal descriptor also provides good results with simple classifiers, suchas linear SVMs, decision trees, or tf-idf, as these can be implemented to rune!ciently on large databases.
Although a number of systems satisfy these desiderata for object instanceor place recognition [18,9] or for whole scene recognition [26], we argue thatno existing system has addressed these requirements in the context of objectcategory recognition.
The system we propose is a form of classifier combination, the components ofthe proposed descriptor are the outputs of a set of predefined category-specificclassifiers applied to the image. The obvious (but only partially correct) intu-ition is that a novel category, say duck, will be expressed in terms of the outputsof base classifiers (which we call “classemes”), describing either objects similarto ducks, or objects seen in conjunction with ducks. Because these base classi-fier outputs provide a rich coding of the image, simple classifiers such as linearSVMs can approach state-of-the art accuracy, satisfying the requirements listedabove. However, the reason this descriptor will work is slightly more subtle. It isnot required or expected that these base categories will provide useful semanticlabels, of the form water, sky, grass, beak. On the contrary, we work on theassumption that modern category recognizers are essentially quite dumb; so aswimmer recognizer looks mainly for water texture, and the bomber!plane rec-ognizer contains some tuning for “C” shapes corresponding to the airplane nose,and perhaps the “V” shapes at the wing and tail. Even if these recognizers areperhaps overspecialized for recognition of their nominal category, they can stillprovide useful building blocks to the learning algorithm that learns to recognize
Related work
• Attribute-based recognition:
Figure 4: Attribute prediction for across category protocols. On the leftis Leave-one-class-out case for Pascal and on the right is attribute predic-tion for Yahoo set. Only attributes relevant to these tasks are displayed.Classes are different during training and testing, thus we have across cat-egory generalization issues. Some attributes on the left, like “engine”,“snout”, and “furry”, generalize well, some do not. Feature selection helpsconsiderably for those attributes, like “taillight”, “cloth”, and “rein” thathave problem generalizing across classes. Similar to leave one class outcase, learning attributes on Pascal08 train set and testing them on Yahooset involves across category generalization, right plot. We can, in fact, pre-dict attributes for new classes fairly reliably. Some attributes, like “wing”,“door”, “headlight”, and “taillight”, do not generalize well. Feature se-lection improves generalization on those attributes. Toward the high endof this curve, where good classifiers sit, feature selection improves predic-tion of attribute with generalization issues and produce similar results forattributes without generalization issues. For better visualization purposeswe sorted the plots based on selected features’ area under ROC curve val-ues.
benefits of our novel feature selection method compared tousing whole features.
6.1. Describing Objects
Assigning attributes: There are two main protocols forattribute prediction: “within category” predictions, wheretrain and test instances are drawn from the same set ofclasses, and “across category” predictions where train andtest instances are drawn from different sets of classes. Wedo across category experiments using a leave-one-class-outapproach, or a new set of classes on a new dataset. We trainattributes in a-Pascal and test them in a-Yahoo. We measureour performance in attribute predictions by the area underthe ROC curve, mainly because it is invariant to class pri-ors. We can predict attributes for the within category proto-col with the area under the curve of 0.834 (Figure 3).
Figure 4 shows that we can predict attributes fairly re-liably for across category protocols. The plot on the leftshows the leave-one-class-out case on a-Pascal and the ploton the right shows the same curve for a-Yahoo set.
Figure 5 depicts 12 typical images from a-Yahoo set witha subset of positively predicted attributes. These attributeclassifiers are learned on a-Pascal train set and tested on a-Yahoo images. Attributes written in red, with red crosses,are wrong predictions.
Unusual attributes: People tend to make statementsabout unexpected aspects of known objects ([11], p101).An advantage of an attribute based representation is wecan easily reproduce this behavior. The ground truth at-tributes specify which attributes are typical for each class.If a reliable attribute classifier predicts one of these typi-cal attributes is absent, we report that it is not visible inthe image. Figure 6 shows some of these typical attributeswhich are not visible in the image. For example, it is worthreporting when we do not see the “wing” an aeroplane isexpected to have. To qualitatively evaluate this task we re-
Figure 5: This figure shows randomly selected positively predicted at-tributes for 12 typical images from 12 categories in Yahoo set. Attributeclassifiers are learned on Pascal train set and tested on Yahoo set. We ran-domly select 5 predicted attributes from the list of 64 attributes available inthe dataset. Bounding boxes around the objects are provided by the datasetand we are only looking inside the bounding boxes to predict attributes.Wrong predictions are written in red and marked with red crosses.
Figure 6: Reporting the absence of typical attributes. For example, weexpect to see “Wing”in an aeroplane. It is worth reporting if we see apicture of an aeroplane for which the wing is not visible or a picture of abird for which the tail is not visible.
Figure 7: Reporting the presence of atypical attributes. For example, wedon’t expect to observe “skin” on a dining table. Notice that, if we haveaccess to information about object semantics, observing “leaf” in an imageof a bird might eventually yield “The bird is in a tree”. Sometimes ourattribute classifiers are confused by some misleading visual similarities,like predicting “Horn” from the visually similar handle bar of a road bike.
ported 752 expected attributes over the whole dataset whichare not visible in the images. 68.2% of these reports arecorrect when compared to our manual labeling of those re-ports (Figure 6). On the other hand, if a reliable attributeclassifier predicts an attribute which is not expected to bein the predicted class, we can report that, too (Figure 7).For example, birds don’t have a “leaf”, and if we see onewe should report it. To quantitatively evaluate this predic-tion we evaluate 951 of those predictions by hand; 47.3%are correct. There are two important consequences. First,because birds never have leaves, we may be able to exploitknowledge of object semantics to reason that, in this case,the bird is in a tree. Second, because we can localize fea-tures used to predict attributes, we can show what causedthe unexpected attribute to be predicted (Figure 8). For ex-ample, we can sometimes tell where the “metal” is in a pic-
Figure 4: Attribute prediction for across category protocols. On the leftis Leave-one-class-out case for Pascal and on the right is attribute predic-tion for Yahoo set. Only attributes relevant to these tasks are displayed.Classes are different during training and testing, thus we have across cat-egory generalization issues. Some attributes on the left, like “engine”,“snout”, and “furry”, generalize well, some do not. Feature selection helpsconsiderably for those attributes, like “taillight”, “cloth”, and “rein” thathave problem generalizing across classes. Similar to leave one class outcase, learning attributes on Pascal08 train set and testing them on Yahooset involves across category generalization, right plot. We can, in fact, pre-dict attributes for new classes fairly reliably. Some attributes, like “wing”,“door”, “headlight”, and “taillight”, do not generalize well. Feature se-lection improves generalization on those attributes. Toward the high endof this curve, where good classifiers sit, feature selection improves predic-tion of attribute with generalization issues and produce similar results forattributes without generalization issues. For better visualization purposeswe sorted the plots based on selected features’ area under ROC curve val-ues.
benefits of our novel feature selection method compared tousing whole features.
6.1. Describing Objects
Assigning attributes: There are two main protocols forattribute prediction: “within category” predictions, wheretrain and test instances are drawn from the same set ofclasses, and “across category” predictions where train andtest instances are drawn from different sets of classes. Wedo across category experiments using a leave-one-class-outapproach, or a new set of classes on a new dataset. We trainattributes in a-Pascal and test them in a-Yahoo. We measureour performance in attribute predictions by the area underthe ROC curve, mainly because it is invariant to class pri-ors. We can predict attributes for the within category proto-col with the area under the curve of 0.834 (Figure 3).
Figure 4 shows that we can predict attributes fairly re-liably for across category protocols. The plot on the leftshows the leave-one-class-out case on a-Pascal and the ploton the right shows the same curve for a-Yahoo set.
Figure 5 depicts 12 typical images from a-Yahoo set witha subset of positively predicted attributes. These attributeclassifiers are learned on a-Pascal train set and tested on a-Yahoo images. Attributes written in red, with red crosses,are wrong predictions.
Unusual attributes: People tend to make statementsabout unexpected aspects of known objects ([11], p101).An advantage of an attribute based representation is wecan easily reproduce this behavior. The ground truth at-tributes specify which attributes are typical for each class.If a reliable attribute classifier predicts one of these typi-cal attributes is absent, we report that it is not visible inthe image. Figure 6 shows some of these typical attributeswhich are not visible in the image. For example, it is worthreporting when we do not see the “wing” an aeroplane isexpected to have. To qualitatively evaluate this task we re-
Figure 5: This figure shows randomly selected positively predicted at-tributes for 12 typical images from 12 categories in Yahoo set. Attributeclassifiers are learned on Pascal train set and tested on Yahoo set. We ran-domly select 5 predicted attributes from the list of 64 attributes available inthe dataset. Bounding boxes around the objects are provided by the datasetand we are only looking inside the bounding boxes to predict attributes.Wrong predictions are written in red and marked with red crosses.
Figure 6: Reporting the absence of typical attributes. For example, weexpect to see “Wing”in an aeroplane. It is worth reporting if we see apicture of an aeroplane for which the wing is not visible or a picture of abird for which the tail is not visible.
Figure 7: Reporting the presence of atypical attributes. For example, wedon’t expect to observe “skin” on a dining table. Notice that, if we haveaccess to information about object semantics, observing “leaf” in an imageof a bird might eventually yield “The bird is in a tree”. Sometimes ourattribute classifiers are confused by some misleading visual similarities,like predicting “Horn” from the visually similar handle bar of a road bike.
ported 752 expected attributes over the whole dataset whichare not visible in the images. 68.2% of these reports arecorrect when compared to our manual labeling of those re-ports (Figure 6). On the other hand, if a reliable attributeclassifier predicts an attribute which is not expected to bein the predicted class, we can report that, too (Figure 7).For example, birds don’t have a “leaf”, and if we see onewe should report it. To quantitatively evaluate this predic-tion we evaluate 951 of those predictions by hand; 47.3%are correct. There are two important consequences. First,because birds never have leaves, we may be able to exploitknowledge of object semantics to reason that, in this case,the bird is in a tree. Second, because we can localize fea-tures used to predict attributes, we can show what causedthe unexpected attribute to be predicted (Figure 8). For ex-ample, we can sometimes tell where the “metal” is in a pic-
[Farhadi et al., CVPR’09]
Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer
Christoph H. Lampert Hannes Nickisch Stefan Harmeling
Max Planck Institute for Biological Cybernetics, Tubingen, Germany
{firstname.lastname}@tuebingen.mpg.de
Abstract
We study the problem of object classification when train-
ing and test classes are disjoint, i.e. no training examples of
the target classes are available. This setup has hardly been
studied in computer vision research, but it is the rule rather
than the exception, because the world contains tens of thou-
sands of different object classes and for only a very few of
them image, collections have been formed and annotated
with suitable class labels.
In this paper, we tackle the problem by introducing
attribute-based classification. It performs object detection
based on a human-specified high-level description of the
target objects instead of training images. The description
consists of arbitrary semantic attributes, like shape, color
or even geographic information. Because such properties
transcend the specific learning task at hand, they can be
pre-learned, e.g. from image datasets unrelated to the cur-
rent task. Afterwards, new classes can be detected based
on their attribute representation, without the need for a new
training phase. In order to evaluate our method and to facil-
itate research in this area, we have assembled a new large-
scale dataset, “Animals with Attributes”, of over 30,000 an-
imal images that match the 50 classes in Osherson’s clas-
sic table of how strongly humans associate 85 semantic at-
tributes with animal classes. Our experiments show that
by using an attribute layer it is indeed possible to build a
learning object detection system that does not require any
training images of the target classes.
1. Introduction
Learning-based methods for recognizing objects in natu-
ral images have made large progress over the last years. For
specific object classes, in particular faces and vehicles, reli-
able and efficient detectors are available, based on the com-
bination of powerful low-level features, e.g. SIFT or HoG,
with modern machine learning techniques, e.g. boosting or
support vector machines. However, in order to achieve good
classification accuracy, these systems require a lot of man-
ually labeled training data, typically hundreds or thousands
of example images for each class to be learned.
It has been estimated that humans distinguish between
at least 30,000 relevant object classes [3]. Training con-
ventional object detectors for all these would require mil-
otter
black: yeswhite: nobrown: yesstripes: nowater: yeseats fish: yes
polar bear
black: nowhite: yesbrown: nostripes: nowater: yeseats fish: yes
zebra
black: yeswhite: yesbrown: nostripes: yeswater: noeats fish: no
Figure 1. A description by high-level attributes allows the transfer
of knowledge between object categories: after learning the visual
appearance of attributes from any classes with training examples,
we can detect also object classes that do not have any training
images, based on which attribute description a test image fits best.
lions of well-labeled training images and is likely out of
reach for years to come. Therefore, numerous techniques
for reducing the number of necessary training images have
been developed, some of which we will discuss in Section 3.
However, all of these techniques still require at least some
labeled training examples to detect future object instances.
Human learning is different: although humans can learn
and abstract well from examples, they are also capable of
detecting completely unseen classes when provided with a
high-level description. E.g., from the phrase “eight-sided
red traffic sign with white writing”, we will be able to detect
stop signs, and when looking for “large gray animals with
long trunks”, we will reliably identify elephants. We build
on this paradigm and propose a system that is able to detect
objects from a list of high-level attributes. The attributes
serve as an intermediate layer in a classifier cascade and
they enable the system to detect object classes, for which it
had not seen a single training example.
Clearly, a large number of possible attributes exist and
collecting separate training material to learn an ordinary
classifier for each of them would be as tedious as for all
object classes. But, instead of creating a separate training
[Lampert et al., CVPR’09]
requires hand-specified attribute-class associations
attribute classifiers must be trained with human-labeled examples
Method overview1. Classeme learning
2. Using the classemes for recognition and retrieval
φ”body of water”(x)→ �
training examples of novel class
...Φ(x1) Φ(xN)
gduck(Φ(x)) =C�
c=1
wduckc φc(x)
...φ”walking”(x)→ �
Classeme learning:choosing the basis classes
• Classeme labels desiderata:
- must be visual concepts
- should span the entire space of visual classes
• Our selection: concepts defined in the Large Scale Ontology for Multimedia [LSCOM] to be “useful, observable and feasible for automatic detection”.
2659 classeme labels, after manual elimination of plurals, near-duplicates, and inappropriate concepts
Classeme learning:gathering the training data
• We downloaded the top 150 images returned by Bing Images for each classeme label
• For each of the 2659 classemes, a one-versus-the-rest training set was formed to learn a binary classifier
yes no
φ”walking”(x)
Classeme learning:training the classifiers
• Each classeme classifier is an LP-β kernel combiner [Gehler and Nowozin, 2009]:
• We use 13 kernels based on spatial pyramid histograms computed from the following features:
- color GIST [Oliva and Torralba, 2001]- oriented gradients [Dalal and Triggs, 2009]- self-similarity descriptors [Schechtman and Irani, 2007]- SIFT [Lowe, 2004]
φ(x) =F�
f=1
βf
�N�
n=1
kf (x,xn)αf,n + bf
�
linear combination of feature-specific SVMs
A dimensionality reduction view of classemes
x =
GIST
oriented gradients
self-similarity descriptor
SIFT
φ1(x)
...φ2659(x)
Φ
• 23K bytes/image
• non-linear kernels are needed for good classification
• near state-of-the-art accuracy with linear classifiers
• can be quantized down to <200 bytes/image with almost no recognition loss
Experiment 1: multiclass recognition on Caltech256
LP-β in [Gehler & Nowozin, 2009]using 39 kernels
LP-β with our x
our approach:linear SVM withclassemes Φ(x)
0 10 20 30 40 500
10
20
30
40
50
60
number of training examples
accu
racy (
%)
LPbeta
LPbeta13
MKL
Csvm
Cq1svm
Xsvm
linear SVM with x
linear SVM withbinarized classemes, i.e. (Φ(x) > 0)
LPbeta Csvm0
500
1000
1500
tim
e (
min
ute
s)
Computational cost comparison
Training time Testing time
LPbeta Csvm0
10
20
30
40
tim
e (
ms)
23 hours
9 minutes
10 15 20 25 30 35 40 4510
0
101
102
103
104
accuracy (%)
co
mp
actn
ess (
ima
ge
s p
er
MB
)
LPbeta13
Csvm
Cq1svm
nbnn [Boiman et al., 2008]
emk [Bo and Sminchisescu, 2008]
Xsvm
Accuracy vs. compactness
Lines link performance at 15 and 30 training examples
188 bytes/image
2.5K bytes/image
23K bytes/image
128K bytes/image
Experiment 2: object class retrievalE!cient Object Category Recognition Using Classemes 787
0 10 20 30 40 500
5
10
15
20
25
30
Number of training images
Pre
cisi
on
@ 2
5
Csvm
Cq1Rocchio (!=1, "=0)
Cq1Rocchio (!=0.75, "=0.15)
Bowsvm
BowRocchio (!=1, "=0)
BowRocchio (!=0.75, "=0.15)
Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match thequery class. Random performance is 0.4%.
We consider two di!erent retrieval methods. The first method is a linear SVMlearned for each of the Caltech classes using the one-vs-all strategy. We comparethese classifiers to the Rocchio algorithm [15], which is a classic informationretrieval technique for implementing relevance feedback. In order to use thismethod we represent each image as a document vector d(x). In the case of theBOW model, d(x) is the traditional tf-idf-weighted histogram of words. In thecase of classemes instead, we define d(x)i = [!i(x) > 0]·idfi, i.e. d(x) is computedby multiplying the binarized classemes by their inverted document frequencies.Given, a set of relevant training images Dr, and a set of non-relevant examplesDnr, Rocchio’s algorithm computes the document query
q = "1
|Dr|!
xr!Dr
d(xr) ! #1
|Dnr|!
xnr!Dnr
d(xnr) (1)
where " and # are scalar values. The algorithm then retrieves the databasedocuments having highest cosine similarity with this query. In our experiment,we set Dr to be the training examples of the class to retrieve, and Dnr tobe the remaining training images. We report results for two di!erent settings:(", #) = (0.75, 0.15), and (", #) = (1, 0) corresponding to the case where onlypositive feedback is used.
Figure 4 shows that methods using classemes consistently outperform thealgorithms based on traditional BOW features. Furthermore, SVM yields muchbetter precision than Rocchio’s algorithm when using classemes. Note that theselinear classifiers can be evaluated very e"ciently even on large data sets; further-more, they can also be trained e"ciently and thus used in applications requiringfast query-time learning: for example, the average time required to learn a one-vs-all SVM using classemes is 674 ms when using 5 training examples from eachCaltech class.
• Random performance is 0.4%
Prec
isio
n (%
) @
25
• training Csvm takes 0.6 sec with 5*256 training examples
Analogies with text retrieval• Classeme representation of an image:
presence/absence of visual attributes
• Bag-of-word representation of a text-document:
presence/absence of words
Related work• Prior work (e.g., [Sivic & Zisserman, 2003; Nister & Stewenius, 2006;
Philbin et al., 2007]) has exploited a similar analogy for object-instance retrieval by representing images as bag of visual words
…
Detect interest patches Compute SIFT descriptors [Lowe, 2004]
Quantize descriptors
…
…..
fre
qu
en
cy
codewords
Represent image as a sparse histogram of visual words
• To extend this methodology to object-class retrieval we need:- to use a representation more suited to object class recognition (e.g. classemes as opposed to bag of visual words)- to train the ranking/retrieval function for every new query-class
Data structures for efficient retrieval
I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1
Incidence matrix:
docu
men
ts
features f0 f1 f2 f3 f4 f5 f6 f7
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
• very compact: only one bit per feature entry
• enables efficient calculation of as: wT Φ, ∀Φ
�
i s.t. Φi �=0
wiΦi
Efficient retrieval via inverted index
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
w: [1.5 -2 0 -5 0 3 -2 0 ]
Goal: compute score for all binary vectors in the databasewT Φ, ∀Φ Φ
Efficient retrieval via inverted index
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
w: [1.5 -2 0 -5 0 3 -2 0 ]
Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via inverted index
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
w: [1.5 -2 0 -5 0 3 -2 0 ]
Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via inverted index
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
w: [1.5 -2 0 -5 0 3 -2 0 ]
Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via inverted index
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
w: [1.5 -2 0 -5 0 3 -2 0 ]
Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via inverted index
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
w: [1.5 -2 0 -5 0 3 -2 0 ]
Scoring: I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
Efficient retrieval via inverted index
I0I2I3I4I6I8
I2I7I8
I0I1I3
I2I3I9
I0I6I8I9
I4I5
I1I4I5I7I9
Inverted index:
f0 f1 f2 f3 f4 f5 f6 f7
I6I9
w: [1.5 -2 0 -5 0 3 -2 0 ]
Cost of scoring is linear in the sum of the lengths of inverted lists associated to non-zero weights
Improve efficiency via sparse weight vectors
Key-idea: force w to contain as many zeros as possible
E(w) = R(w) + CN
�Nn=1 L(w; Φn, yn)
Learning objective
regularizer loss function
label ofexample n
classeme vector of example n
• L2-SVM: ,R(w) = wT w L(w; Φn, yn) = max(0, 1− yn(wT Φn))
• Since for small and for large ,
|wi| > w2i wi
|wi| < w2i wi
choosing will tend to
produce a small number of larger
weights and more zero weights
R(w) =�
i |wi|
Tomographic inversion with !1 wavelet penalization 3
w1
w2
d = AWTw
!1-ball: |w1| + |w2| = constant
!2-ball: w21 + w2
2 = constant
w with d = AWTw and smallest !1-norm
w with d = AWTw and smallest !2-norm
w
|w|
w2
Figure 1. Sparsity, !1 minimization and !2 minimization: Left: Because the !1-ball has no bulge, the solution with smallest !1-norm issparser than the solution with smallest !2-norm. Right: A |w| penalization e!ects small coe"cients more and large coe"cients less thanthe (traditional) w2 penalization.
where c(n) is independent of w. This functional has a much simpler form than the original I1(w) because there is no operator
AWT mixing di!erent components of w. The next approximation w
(n+1) is defined by the minimizer of this new functional.
By calculating the derivative of expression (3) with respect to a specific wavelet or scaling coe"cient wi, one finds the following
set of component-by-component equations:
wi !!
WATd + (I ! WA
TAW
T )w(n)"
i+ " sign(wi) = 0, (4)
valid whenever wi "= 0. These equations are solved by distinguishing the two cases wi > 0 and wi < 0; the solution —
corresponding to the minimizer of the surrogate functional I(n)1 (w), and denoted by w
(n+1) — is then found to equal
w(n+1) = S!
#
WATd + (I ! WA
TAW
T )w(n)$
, (5)
where S! is the so-called soft-thresholding operation, i.e. (see Fig. 2, right side)
S! (w) =
%
&
'
w ! " w # "
0 |w| $ "
w + " w $ !",
(6)
performed on each wavelet or scaling coe"cient wi individually. The starting point of the iteration procedure is arbitrary,
e.g. w(0) = 0. Because of the component-wise character of the tresholding, it is straightforward to use di!erent thresholds
"i for di!erent components wi if desired, and in fact we shall use di!erent thresholds "w and "s for the wavelet and scaling
coe"cients in our application. A schematic representation of the idea behind the iteration (5) is given in Fig. 2. We realize
that this iteration converges slowly for ill-conditioned matrices, but we use it here because it is proven to converge to the
solution (Daubechies et al. 2004).
An improvement in convergence can be gained by rescaling the operator A (and rescaling the data d at the same time)
in such a way that the largest eigenvalue of #2A
TA is close to (but smaller than) unity. The iteration corresponding to the
minimization of this new functional is
w(n+1) = S!"2
#
#2WA
Td + (I ! #2
WATAW
T )w(n)$
. (7)
We will also make use of the following two-step procedure: from the outcome m = WTw of the iteration (7), we define new
data d! = 2d ! Am and restart the same iteration with this new data:
w(n+1) = S!"2
#
#2WA
Td! + (I ! #2
WATAW
T )w(n)$
, w(0) = w. (8)
The outcome m = WTw of this second iteration is then the final, regularized reconstruction of the model. For the same value
of "#2, the second step improves the data fit considerably, %d ! Am%2 < %d ! Am%2; hence a given level of final data fit
$2 will, in the two-step procedure, correspond to a higher value of "#2. Because "#2 determines the threshold level, a higher
value will lead to more aggressive thresholding and thus faster convergence to a sparse solution.
The above method will be demonstrated in the next section and compared to a conventional !2-regularization method,
in which the functional
I2(m) = %d ! Am%22 + "%m%2
2 (9)
|wi|w2
i
wi
Improve efficiency via sparse weight vectors
Key-idea: force w to contain as many zeros as possible
E(w) = R(w) + CN
�Nn=1 L(w; Φn, yn)
Learning objective
regularizer loss function
label ofexample n
classeme vector of example n
• L2-SVM: ,R(w) = wT w L(w; Φn, yn) = max(0, 1− yn(wT Φn))
• L1-LR: , L(w; Φn, yn) = log(1 + exp(−ynwT Φn))
• FGM (Feature Generating Machine) [Tan et al., 2010]:
R(w) = wT w , L(w; Φn, yn) = max(0, 1− yn(w ⊙ d)T Φn)
s.t. 1T d ≤ B d ∈ {0, 1}D elementwise product
R(w) =�
i |wi|
Performance evaluation on ImageNet (10M images)
20 40 60 80 100 120 1400
5
10
15
20
25
30
35
Search time per query (seconds)
Pre
cisi
on @
10
(%)
Full inner product evaluation L2 SVMFull inner product evaluation L1 LRInverted index L2 SVMInverted index L1 LR
• Performance averaged over 400 object classes used as queries
• 10 training examples per query class• Database includes 450 images of the query
class and 9.7M images of other classes• Prec@10 of a random classifiers is 0.005%
Each curve is obtained by varying sparsity through C in training objective
E(w) = R(w) + CN
�Nn=1 L(w; Φn, yn)
regularizer loss function
20 40 60 80 100 120 1400
5
10
15
20
25
30
35
Search time per query (seconds)
Prec
isio
n @
10
(%)
! [Rastegari et al., 2011]
Top-k ranking
• Do we need to rank the entire database? - users only care about the top-ranked images
• Key idea: - for each image iteratively update an upper-bound and a lower-bound on the score
- gradually prune images that cannot rank in the top-k
Top-k pruning
w: [ 3 -2 0 -6 0 3 -2 0 ]
! [Rastegari et al., 2011]
I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1
f0 f1 f2 f3 f4 f5 f6 f7
→ initial upper bound
• Highest possible score:for binary vector s.t. ΦU
u∗ = wT · ΦU
ΦUi = 1 iff wi > 0
(6 in this case)
→ initial lower bound
• Lowest possible score:for binary vector s.t.
ΦLi = 1 iff wi < 0
l∗ = wT · ΦL
ΦL
(-10 in this case)
Top-k pruning ! [Rastegari et al., 2011]
I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1
f0 f1 f2 f3 f4 f5 f6 f7
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
• Initialization: for all imagesu∗, l∗
0
upper bound
lower bound
w: [ 3 -2 0 -6 0 3 -2 0 ]
Top-k pruning ! [Rastegari et al., 2011]
I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1
f0 f1 f2 f3 f4 f5 f6 f7
w: [ 3 -2 0 -6 0 3 -2 0 ]
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
0
• Load feature i • Since wi = +3 (>0), for each image n:
- subtract +3 from the upper bound if - add +3 to the lower bound if
φn,i = 0φn,i = 1
Top-k pruning ! [Rastegari et al., 2011]
I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1
f0 f1 f2 f3 f4 f5 f6 f7
φn,i = 0φn,i = 1
w: [ 3 -2 0 -6 0 3 -2 0 ]
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
0
• Load feature i • Since wi = -2 (<0), for each image n:
- decrement by 2 the upper bound if - increment by 2 the lower bound if
Top-k pruning ! [Rastegari et al., 2011]
I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1
f0 f1 f2 f3 f4 f5 f6 f7
φn,i = 0φn,i = 1
w: [ 3 -2 0 -6 0 3 -2 0 ]
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
0
• Load feature i • Since wi = -6 (<0), for each image n:
- decrement by 6 the upper bound if - increment by 6 the lower bound if
Top-k pruning ! [Rastegari et al., 2011]
• Suppose k = 4:we can prune I2,I9 since they cannot rank in the top-k
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
0I0: 1 0 1 0 0 1 0 0 I1: 0 0 1 0 1 0 0 0I2: 1 1 0 1 0 0 0 0I3: 1 0 1 1 0 0 0 0I4: 1 0 0 0 1 0 1 0I5: 0 0 0 0 1 0 1 0I6: 1 0 0 0 0 1 0 1I7: 0 1 0 0 1 0 0 0I8: 1 1 0 0 0 1 0 0I9: 0 0 0 1 1 1 0 1
f0 f1 f2 f3 f4 f5 f6 f7
w: [ 3 -2 0 -6 0 3 -2 0 ]
Distribution of weights and pruning rate
540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593
594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647
ICCV#1745
ICCV#1745
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
0 500 1000 1500 2000 25000
0.2
0.4
0.6
0.8
1
Dimension
Dis
tribu
tion
of a
bsol
ute
wei
ght v
alue
s
L1−LRL2−SVMFGM
a 0 500 1000 1500 2000 25000
20
40
60
80
100
Number of iterations (d)
% o
f im
ages
pru
ned
TkP L1−LR, k=10TkP L1−LR, k=3000TkP L2−SVM, k=10TkP L2−SVM, k=3000TkP FGM, k=10TkP FGM, k=3000
b
Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster withsparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values of k (k = 10, 3000).
a smaller value of k allows the method to eliminate moreimages from consideration at a very early stage.
We now turn to study the effect of parameters D!, v, won the efficiency and accuracy of AR. Figure 3 shows re-trieval speed and precision obtained by varying v and wfor D! ! {128, 256, 512}. Increasing the dictionary size(w) reduces the quantization error while raising the quan-tization time: note the slightly better accuracy but highersearch time when we move from parameter setting (D! =512, v = 256, w = 26) to (D! = 512, v = 256, w = 28).The number of sub-blocks (v) critically affects the retrievaltime: reducing v lowers a lot the search time but causes adrop in accuracy. Finally, note how D! impacts the accu-racy since it affects both the number of parameters in theclassifier as well as the projection error: using a large D!
is beneficial for accuracy when v and w are large; however,when there are few cluster centroids or the number of sub-blocks is small, lowering D! improves precision since thismitigates the quantization error.
Finally, we also ran an experiment simulating real-worldusage of an object-class retrieval system where a user mayprovide a positive training set but no negative set. In suchcases one could use a “background” set for the negative ex-amples. Thus, here we used as negative examples for eachquery, n" = 999 randomly chosen images from all 1000categories, thus possibly containing also some true positives(i.e., images of the query class). As expected, we foundthe precisions of the L1-LR and L2-SVM classifiers to benearly unchanged by the few incorrectly labeled examples:precisions at 10 in this case are 18.75% and 22.55%, respec-tively.
0 0.05 0.1 0.15 0.2 0.25 0.30
5
10
15
20
Search time per query (seconds)
Prec
isio
n @
10
(%)
D’=512D’=256D’=128
v=128w=28
v=64w=26
v=32w=28
v=128w=28
v=64w=26
v=32w=28
v=16w=28
v=256w=26
v=16w=28
v=64w=26
v=32w=28
v=16w=28
v=256w=28
Figure 3. Effects of parameters D!, v, w on the accuracy andsearch time of AR for the ILSVRC2010 data set. A small v impliesfaster retrieval at the expense of accuracy. Using a larger value forw reduces the quantization error at a small increase in search time.LoweringD! decreases the power of the classifier (VC-dimension)and increases the PCA projection error, thus negatively impactingprecision.
Retrieval results on ImageNet (10M images). We nowpresent results on the 10-million ImageNet dataset [4]which encompasses over 15,000 categories (in our exper-iment we used 15203 classes). We used a subset of 950categories as query classes. For each of these classes wecapped the number of true positives in the database to ben+
test = 450. The total number of distractors for each queryis n"
test = 9, 671, 611. We trained classifiers for each querycategory using a training set consisting of n+ = 10 posi-
6
Features considered in descending order of |wi |
540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593
594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647
ICCV#1745
ICCV#1745
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
0 500 1000 1500 2000 25000
0.2
0.4
0.6
0.8
1
Dimension
Dis
tribu
tion
of a
bsol
ute
wei
ght v
alue
s
L1−LRL2−SVMFGM
a 0 500 1000 1500 2000 25000
20
40
60
80
100
Number of iterations (d)
% o
f im
ages
pru
ned
TkP L1−LR, k=10TkP L1−LR, k=3000TkP L2−SVM, k=10TkP L2−SVM, k=3000TkP FGM, k=10TkP FGM, k=3000
b
Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster withsparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values of k (k = 10, 3000).
a smaller value of k allows the method to eliminate moreimages from consideration at a very early stage.
We now turn to study the effect of parameters D!, v, won the efficiency and accuracy of AR. Figure 3 shows re-trieval speed and precision obtained by varying v and wfor D! ! {128, 256, 512}. Increasing the dictionary size(w) reduces the quantization error while raising the quan-tization time: note the slightly better accuracy but highersearch time when we move from parameter setting (D! =512, v = 256, w = 26) to (D! = 512, v = 256, w = 28).The number of sub-blocks (v) critically affects the retrievaltime: reducing v lowers a lot the search time but causes adrop in accuracy. Finally, note how D! impacts the accu-racy since it affects both the number of parameters in theclassifier as well as the projection error: using a large D!
is beneficial for accuracy when v and w are large; however,when there are few cluster centroids or the number of sub-blocks is small, lowering D! improves precision since thismitigates the quantization error.
Finally, we also ran an experiment simulating real-worldusage of an object-class retrieval system where a user mayprovide a positive training set but no negative set. In suchcases one could use a “background” set for the negative ex-amples. Thus, here we used as negative examples for eachquery, n" = 999 randomly chosen images from all 1000categories, thus possibly containing also some true positives(i.e., images of the query class). As expected, we foundthe precisions of the L1-LR and L2-SVM classifiers to benearly unchanged by the few incorrectly labeled examples:precisions at 10 in this case are 18.75% and 22.55%, respec-tively.
0 0.05 0.1 0.15 0.2 0.25 0.30
5
10
15
20
Search time per query (seconds)
Prec
isio
n @
10
(%)
D’=512D’=256D’=128
v=128w=28
v=64w=26
v=32w=28
v=128w=28
v=64w=26
v=32w=28
v=16w=28
v=256w=26
v=16w=28
v=64w=26
v=32w=28
v=16w=28
v=256w=28
Figure 3. Effects of parameters D!, v, w on the accuracy andsearch time of AR for the ILSVRC2010 data set. A small v impliesfaster retrieval at the expense of accuracy. Using a larger value forw reduces the quantization error at a small increase in search time.LoweringD! decreases the power of the classifier (VC-dimension)and increases the PCA projection error, thus negatively impactingprecision.
Retrieval results on ImageNet (10M images). We nowpresent results on the 10-million ImageNet dataset [4]which encompasses over 15,000 categories (in our exper-iment we used 15203 classes). We used a subset of 950categories as query classes. For each of these classes wecapped the number of true positives in the database to ben+
test = 450. The total number of distractors for each queryis n"
test = 9, 671, 611. We trained classifiers for each querycategory using a training set consisting of n+ = 10 posi-
6
norm
aliz
ed a
bsol
ute
wei
ght
valu
es
Performance evaluation on ImageNet (10M images)
0 50 100 1500
5
10
15
20
25
30
35
Search time per query (seconds)
Prec
isio
n @
10
(%)
TkP L1−LRTkP L2−SVMInverted index L1−LRInverted index L2−SVM
0 50 100 1500
5
10
15
20
25
30
35
Search time per query (seconds)
Prec
isio
n @
10
(%)
• k = 10• Performance averaged over 400 object
classes used as queries• 10 training examples per query class• Database includes 450 images of the query
class and 9.7M images of other classes• Prec@10 of a random classifiers is 0.005%
Each curve is obtained by varying sparsity through C in training objective
E(w) = R(w) + CN
�Nn=1 L(w; Φn, yn)
regularizer loss function
! [Rastegari et al., 2011]
Alternative search strategy:approximate ranking
• Key-idea: approximate the score function with a measure that can computed (more) efficiently (related to approximate NN search:[Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al., 2008])
• Approximate ranking via vector quantization:
!q(!)
wT Φ ≈ wT q(Φ)
where is a quantizer returning the cluster centroid nearest to
q(.)Φ
• Problem: - to approximate well the score we need a fine quantization- the dimensionality of our space is D=2659: too large to enable a fine quantization using k-means clustering
Product quantization! [Jegou et al., 2011]
• Split feature vector ! into v subvectors: ! " [ !1 | !2 | ... | !v ]
• Subvectors are quantized separately by quantizers
q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ]where each qi(.) is learned in a space of dimensionality D/v
• Example from [Jegou et al., 2011]: ! is a 128-dimensional vector split into 8 subvectors of dimension 16
!1 !2 !3 !4 !5 !6 !7 !8
q1
q1(!1)
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
16 components
28 = 256 centroids
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
8 bits
q2
q2(!2)
q3
q3(!3)
q4
q4(!4)
q5
q5(!5)
q6
q6(!6)
q7
q7(!7)
q8
q8(!8)
⇒ 64-bit quantization index
q1
q1(!1)
q1
q1(!1)
q1
q1(!1)
q1
q1(!1)
Efficient approximate scoring
wT Φ ≈ wT q(Φ) =v�
j=1
wTj qj(Φj)
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
can be precomputed and stored in alook-up table
w1
w2...
wv
1.Filling the look-up table:
sub-
bloc
ks
centroids (r per sub-block)
Efficient approximate scoring
wT Φ ≈ wT q(Φ) =v�
j=1
wTj qj(Φj)
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
can be precomputed and stored in alook-up table
w1
w2...
wv
1.Filling the look-up table:
quantization for sub-block 1:
s11
sub-
bloc
ks
centroids (r per sub-block)
inner product
Efficient approximate scoring
wT Φ ≈ wT q(Φ) =v�
j=1
wTj qj(Φj)
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
can be precomputed and stored in alook-up table
w1
w2...
wv
1.Filling the look-up table:
quantization for sub-block 1:
s11 s12
sub-
bloc
ks
centroids (r per sub-block)
inner product
Efficient approximate scoring
wT Φ ≈ wT q(Φ) =v�
j=1
wTj qj(Φj)
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
can be precomputed and stored in alook-up table
w1
w2...
wv
1.Filling the look-up table:
quantization for sub-block 1:
s11 s12 s13 ... ... ... ... ... ... s1r
sub-
bloc
ks
centroids (r per sub-block)
inner product
Efficient approximate scoring
wT Φ ≈ wT q(Φ) =v�
j=1
wTj qj(Φj)
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
can be precomputed and stored in alook-up table
w1
w2...
wv
1.Filling the look-up table:
s11 s12 s13 ... ... ... ... ... ... s1r
s21
sub-
bloc
ks
quantization for sub-block 2: centroids (r per sub-block)
inner product
Efficient approximate scoring
wT Φ ≈ wT q(Φ) =v�
j=1
wTj qj(Φj)
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Product quantization for nearest neighbor search
8 bits
16 components
64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
can be precomputed and stored in alook-up table
s11 s12 s13 ... ... ... ... ... ... s1r
s21 s22 s23 ... ... ... ... ... ... s2r
... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...sv1 sv2 sv3 ... ... ... ... ... ... svr
sub-
bloc
ks
2.Score each quantized vector in the database using the look-up table:
q(Φ)
wT q(Φ) = wT1 q1(Φ1) + wT
2 q2(Φ2) + . . . + wTv qv(Φv)
wT q(Φ) = wT1 q1(Φ1) + wT
2 q2(Φ2) + . . . + wTv qv(Φv)
centroids (r per sub-block)
Only v additions per image!
Choice of parameters! [Rastegari et al., 2011]
• Dimensionality is first reduced with PCA from D=2659 to D’ < D
• How do we choose D’, v (number of sub-blocks), r (number of centroids per sub-block)?
0 0.05 0.1 0.15 0.2 0.25 0.30
5
10
15
20
Search time per query (seconds)
Prec
isio
n @
10
(%)
D’=512D’=256D’=128
(32,28)
(64,26)(64,26)
(16,28)
(128,28)
(32,28)
(16,28)
(32,28)
(64,26)
(128,28)
(256,26)(256,28)
(16,28)
• Effect of parameter choices on a database of 150K images:
(v,r)
432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485
486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539
ICCV#1745
ICCV#1745
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
0 0.5 1 1.5 2 2.50
5
10
15
20
25
Search time per query (seconds)
Prec
isio
n @
10
(%)
AR L2−SVMTkP L1−LRTkP L2−SVMTkP FGM
Figure 1. Class-retrieval precision versus search time for theILSVRC2010 data set: x-axis is search time; y-axis shows per-centage of true positives ranked in the top 10 using a databaseof 150,000 images (with n!
test = 149, 850 distractors andn+
test = 150 true positives for each query class). The curve foreach method is obtained by varying parameters controlling theaccuracy-speed tradeoff (see details in the text).
subdivide a vector according to the order of components sothat the j-th sub-block would consist of the consecutive fea-ture entries from position (1 + (j ! 1)D!/v) to (jD!/v).However, such strategy would blindly allocate the samenumber of centroids for the most informative components(the ones in the first sub-block) as well as for the least in-formative. We address this problem using the solution pro-posed in [13]: we apply a random orthogonal transforma-tion after PCA so that the variances of the resulting com-ponents will be more even. We then quantize the examplesand train our retrieval models in this space.
6. ExperimentsIn this section we empirically evaluate the proposed al-
gorithms and the several possible parameter options onchallenging data sets under the performance measures ofretrieval accuracy, speed and memory usage. We denote thetop-k pruning method with TkP and the approximate rank-ing technique with AR.
Retrieval evaluation on ILSVRC2010 (150K images).We first evaluate our methods using the data set ofthe Large Scale Visual Recognition Challenge 2010(ILSVRC2010) [1], which includes images of 1000 differ-ent categories. We use a subset of the ILSVRC2010 trainingset to learn the classifiers: for each of the 1000 classes, wetrain a classifier using n+ = 50 positive examples (i.e., im-ages belonging to the query category) and n" = 999 nega-tive examples obtained by sampling one image from each of
the other classes. To cope with the largely unequal numberof positive and negative examples (n" >> n+) we nor-malize the loss term for each example in eq. 1 by the sizeof its class. We evaluate the learned retrieval models on theILSVRC2010 test set, which includes 150,000 images, with150 examples per category. Thus, the database containsn+
test = 150 true positives and n"
test = 149, 850 distrac-tors for each query. Figure 1 shows precision versus searchtime for AR and TkP in combination with different classi-fication models. Since AR does not use sparsity to achieveefficiency, we only paired it with the L2-SVM model. Thex-axis shows average retrieval time per query, measured ona single-core computer with 16GB of RAM and an IntelCore i7-930 CPU @ 2.80GHz. The y-axis reports precisionat 10 which measures the proportion of true positives in thetop 10. The times reported for TkP were obtained usingk = 10. The curve for AR was generated by varying theparameter choices for v andw, as discussed in further detaillater. The performance curves for “TkP L1-LR” and “TkPL2-SVM” were produced by varying the regularization hy-perparameterC in eq. 1. While C is traditionally viewed ascontrolling the bias-variance tradeoff, in our context it canbe interpreted as a parameter balancing generalization ac-curacy versus sparsity, and thus retrieval speed. In the caseof “TkP FGM” we have kept a constant C (tuned by cross-validation), and instead varied the sparsity of this classifierby acting on the separate parameter B. From this figurewe see that AR is overall the fastest method at the expenseof search accuracy: a peak precision of 22.6% is obtainedby TkP using L2-SVM but AR with the same classificationmodel achieves only a top precision of 17.5% due to a com-bination of fewer learning parameters (in this experimentwe used D! = 512), PCA projection error and quantizationerror. As expected, we note that TkP runs faster when usedin combination with L1-LR or FGM rather than L2-SVM,since it benefits from sparsity in the parameter vectors toeliminate images from consideration. However, we see thatsparsity negatively affects accuracy, with L2-SVM provid-ing clearly much better precision compared to L1-LR.
In our experiments we found that TkP typically ex-hibits faster retrieval in conjunction with L1-LR rather thanFGM. We can gain an intuition on the reasons by inspect-ing the average distribution of weight absolute values infigure 2(a). The average distribution for each classifica-tion model was obtained by first sorting the weight abso-lute values for each query in descending order and thennormalizing by the largest absolute value. For this exper-iment we chose B = 1000 for the FGM model. We cansee that although for this setting the weight vectors learnedby FGM are on average more sparse than those producedby L1-LR, the normalized magnitude of the L1-LR weightsdecays much faster. TkP benefits from the presence of thesehighly skewed weight magnitudes to produce more aggres-
5
Performance evaluation on 150K images
• Performance averaged over 1000 object classes used as queries
• 50 training examples per query class
• Database includes 150 images of the query class and 150K images of other classes
• Prec@10 of a random classifiers is 0.1%
approximate ranking
Memory requirements for 10M images
1 2 30
2
4
6
8
9 Gbytes
3 Gbytes
1.8 Gbytes
mem
ory
usag
e
Inverted index
Incidence matrix(used by TkP)
Product quantization index
Conclusions and open questions
• Compact descriptor enabling efficient novel-class recognition (less than 200 bytes/image yet it produces performance similar to MKL at a tiny fraction of the cost)
• Questions currently under investigation:- can we learn better classemes from fully-labeled data?- can we decouple the descriptor size from the number of classeme classes?- can we encode spatial information ([Li et al. NIPS10])?
• Software for classeme extraction available at:http://vlg.cs.dartmouth.edu/projects/classemes_extractor/
Classemes:
Information retrieval approaches to large-scale object-class search:
• sparse representations and retrieval models
• top-k ranking
• approximate scoring
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - LorenzoTorresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
The Life of Structured Learned Dictionaries
Guillermo SapiroUniversity of Minnesota
G. Yu and S. Mallat (Inverse problems via GMM)G. Yu and F. Leger (Matrix completion)G. Yu (Statistical compressed sensing)
A. Castrodad (activity recognition in video)M. Zhou, D. Dunson, and L. Carin (video layers separation)
1
Friday, July 8, 2011
ExamplesZooming
Inpainting
Deblurring
Inverse Problemsy = Uf +w
w ∼ N (0,σ2Id)
f : maskingU
: subsamplingU : convolutionw ∼ N (0,σ2Id)U
3
Friday, July 8, 2011
• High computational complexity.
• Huge numbers of parameters to estimate.
• Behavior not well understood (results starting to appear).
• Dictionary learning
• Non-convex.
Dictionary learning
minD,{ai}1≤i≤I
�
1≤i≤I
��fi −Dai�2 + λ�ai�1
�
• Better performance than pre-fixed dictionaries.
Learned Overcomplete Dictionaries
11
Friday, July 8, 2011
Sparse Inverse Problem Estimation
provides a sparse representation for .
provides a sparse representation for .
Sparse estimation of from
Inverse problem estimation
wherey = Uf +w w ∼ N (0,σ2Id)
f
f = Da+ �Λ |Λ| � |Γ| , ��Λ�2 � �f�2Λ = support(a)
D = {φm}m∈Γ
UD = {Uφm}m∈Γ y
y = UDa+ ��Λ with |Λ| � |Γ| , andΛ = support(a)
a ya = argmin
a�UDa− y�2 + λ �a�1
f = Da
• Sparse inverse problem estimation
• Observation
• Sparse prior
with and
12
����2 � �y�2
Friday, July 8, 2011
Structured Representation and Estimation
D B1 B2 B3 B4 B5Overcomplete dictionary Structured overcomplete dictionary
• Dictionary: union of PCAs
• Union of orthogonal bases
• In each basis, the atoms are ordered:
• Piecewise linear estimation (PLE)
• A linear estimator per basis
• Non-linear basis selection: a best linear estimator is selected
• Small degree of freedom, fast computation, state-of-the-art performance
λk1 ≥ λk
2 ≥ · · · ≥ λkN
D = {Bk}1≤k≤K
16
Friday, July 8, 2011
Gaussian Mixture Models
• Estimate from {(µk,Σk)}1≤k≤K
• Identify the Gaussian that generates ,ki ∀i
• Estimate from , N (µki ,Σki) ∀i
{yi}1≤i≤I
fi
fi
18
whereyi = Uifi +wi wi ∼ N (0,σ2Id)
Friday, July 8, 2011
Structured Sparsity• PCA (Principal Component Analysis)
• , eigenvalues.λk1 ≥ λk
2 ≥ · · · ≥ λkN
• PCA transform
• MAP with PCA
⇔
fki = Bkaki
Σk = BkSkBTk
Sk = diag(λk1 , . . . ,λ
kN )
aki = argminai
��UiBkai − yi�2 + σ2
N�
m=1
|ai[m]|2
λkm
�
fki = argminfi
��Uifi − yi�2 + σ2fTi Σ−1
k fi�
22
• PCA basis, orthogonal.Bk = {φkm}1≤m≤N
Friday, July 8, 2011
Structured Sparsity
aki = argminai
��UiBkai − yi�2 + σ2
N�
m=1
|ai[m]|2
λkm
�Piecewise linear estimate Sparse estimate v.s.
D B1 B2 B3 B4 B5
• Nonlinear basis selection, degree of freedom . K
Full degree of freedom in atom selection
�|Γ||Λ|
�
• Linear collaborative filtering in each basis.
23
ai = argminai
�UDai − yi�2 + λ
|Γ|�
m=1
|ai[m]|
Friday, July 8, 2011
Initial Experiments: Evolution
24
Clustering 1st iteration
Clustering 2nd iteration
Friday, July 8, 2011
Experiments: Inpainting
Original 20% available MCA 24.18 dB ASR 21.84 dB
KR 21.55 dB FOE 21.92 dB BP 25.54 dB PLE 27.65 dB
[Elad, Starck, Querre, Donoho, 05] [Guleryuz, 06]
[Takeda, Farsiu. Milanfar, 06] [Roth and Black, 09] [Zhou, Sapiro, Carin, 10]26
Friday, July 8, 2011
Experiments: Zooming
Original Bicubic 28.47 dB SAI 30.32 dB PLE 30.64 dBSR 23.85 dB
Low-resolution
SAI [Zhang and Wu, 08]SR [Yang, Wright, Huang, Ma, 09]
29
Friday, July 8, 2011
Experiments: Zooming Deblurring
f Uf y = SUf
Iy PLE 30.49 dB SR 28.93 dB29.40 dB
[Yang, Wright, Huang, Ma, 09]
32
Friday, July 8, 2011
Experiments: Denoising
34
Original Noisy 22.10 dB NLmeans 28.42 dB
FOE 25.62 dB BM3D 30.97 dB PLE 31.00 dB
[Buades et al, 06]
[Roth and Black, 09] [Dabov et al, 07]
Friday, July 8, 2011
Summary of this part
• Gaussian mixture models and MAP-EM work well for image inverse problems.
• Piecewise linear estimation, connection to structured sparsity.
• Nonlinear best basis selection, small degree of freedom.
• Faster computation than sparse estimation.
• Results in the same ballpark of the state-of-the-art.
• Beyond images: recommender systems and audio (Sprechmann & Cancela)
• Statistical compressed sensing38
• Collaborative linear filtering.
Friday, July 8, 2011
Modeling and Learning Human Ac2vity
Alexey Castrodad1,2 and Guillermo Sapiro2 1 NGA Basic and Applied Research
2 University of Minnesota, ECE Department [email protected] , [email protected]
Mo2va2on • Problem: Given volumes of video feed, detect ac2vi2es of interest
§ Mostly done manually! • Solving this will:
§ Aid the operator: surveillance/security, gaming, psychological research
§ SiV through large amounts of data • Solu2on: Fully/semi-‐automa2c ac2vity detec2on with
minimum human interac2on § Invariance to spa2al transforma2ons § Robust to occlusions, low resolu2on, noise § Fast and accurate § Simple, generic
4
Sparse modeling: Dic2onary learning from data
7
Sparse modeling for ac2on classifica2on: Phase 1
• Input Videos
• Spa2al Temporal Features
• Sparse Modeling
9
• l1 Pooling
Classifier output
D1 D2 D3 D
A1 A2 A3
New video
Feature Extrac2on
Sparse coding
Classifica2on
Training Class 1 Class 2 Class 3
Sparse modeling for ac2on classifica2on: Phase 2
• Sparse Modeling
• Inter-‐class Modeling
10
• l1 Pooling
D1 D2 D3 D
Training Videos
Feature Extrac2on
Sparse coding
Training
Classifica2on
New video
Feature Extrac2on
Sparse coding
E1 E2 E3
Sparse Coding
A1 A2 A3
Classifier output from Phase 1
Results • YouTube Ac2on Dataset
§ variable spa2al resolu2on videos, 3-‐8 seconds each § 11 types of ac2ons from YouTube videos
18
Scene AcGons Camera ResoluGon Frame Rate
indoors/outdoors basketball shoo2ng, cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, walking with a dog
jiaer, scale varia2ons, camera mo2on, variable illumina2on condi2ons, high background cluaer
variable, resampled to 320 x 240
25 fps
Results: YouTube Ac2on Dataset § Best/recent reported: 75.8% (Q.V. Le et al., 2011); 84.2%
(Wang et al., 2011) § Recogni2on rate: 80.29 % (phase 1) and 91.9% (phase 2)
20
Conclusion • Main contribu2on:
§ Robust ac2vity recogni2on framework based on sparse modeling
§ Generic: works on mul2ple data sources § State-‐of-‐the-‐art results in all of them, same parameters
• Key advantage: § Simplicity, state of the art results § Fast and accurate: 7.5 fps § 7 frames needed for detec2on
• Future direc2on: § Exploit human interac2ons § Infer the ac2ons § Foreground extrac2on/video analysis for ac2vity clustering
21
ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - LorenzoTorresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
Shift-Map Image EditingYael Pritch
Eitam Kav-VenakiShmuel Peleg
The Hebrew University of Jerusalem
Retargeting (Avidan and Shamir SIGGRAPH’07, Wolf et al., ICCV’07, Wang et al., SIGASIA’08, Rubinstein et al., SIGGRAPH’08, Rubinstein et al.,SIGGRAPH’09)
Input
Geometrical Image Editing:Retargeting
Shift-MapOutput
Geometrical Image Editing:Inpainting
Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)
InputMask
OutputMask
Geometrical Image Editing:Inpainting
Inpainting (Criminisi et al. CVPR’03, Wexler et al. CVPR’04,Sun, et al. SIGGRAPH’05, Komodakis et al. CVPR’06, Hays and Efros, SIGGRAPH’07)
Shift-Map Composition
User Constraints
A B C D
Shift-Map Composition
User Constraints
A B C D
A
Shift-Map Composition
User Constraints
A B C D
A B
Shift-Map Composition
User Constraints
A B C D
A BC
Shift-Map Composition
User Constraints
A B C D
A B DCNo accuratesegmentation
required
Shift-Map Composition
User Constraints
A B C D
No accuratesegmentation
required
• Shift-Maps represent a mapping for each pixel in the output image into the input image
• The color of the output pixel is copied from corresponding input pixel
Our Approach : Shift-Map
Output : R(u,v) Input : I(x,y)
• We use relative mapping coordinate (like in Optical Flow)
Our Approach : Shift-Map
(u,v)
(x,y)
Output : R(u,v) Input : I(x,y)
(u,v)
Ty = 10
Our Approach : Shift-Map
Vertical ShiftsHorizontal Shifts
Shift-MapOutput Image
Input
• Minimal distortion• Adaptive boundaries• Fast optimization
Tx = 50
Tx = 0Tx = 400
Ty = 10
Tx = 50
Output
• We look for the optimal mapping - can be described as an Energy Minimization problem
Smoothness term :Avoid Stitching Artifacts
Data term : External Editing Requirement
Geometric Editing as an Energy Minimization
Compute For Each Pixel Compute For Each Pair of Neighboring pixels
• Unified representation for geometric editing applications• Solved using a graph labeling algorithm
The Smoothness Term
color
gradient
q’
p’ np’
nq’
R - Output Image I - Input Image
For p For q
(Kwatra et al. 03, Agarwala et al. 04)
p qDiscontinuity in the shift-map
• Data term varies between different application• Inpainting data term uses data mask D(x,y) over the
input image– D(x,y)= ∞ for pixels to be removed– D(x,y)=0 elsewhere
• Specific input pixels can be forced not to be included in the output image by setting D(x,y)=∞
The Data Term: Inpainting
(x,y)
0=D
(u,v)
• Input pixels can be forced to appear in a new location
(u,v)
(x,y)
• Appropriate shift gets infinitely low energy
• Other shifts getinfinitely high energy
The Data Term: Rearrangement
• Use picture borders• Can incorporate importance mask
– Order constraint on mapping is applied to prevent duplications of important areas
The Data Term: Retargeting
• Minimal energy mapping can be represented as graph labeling where the Shift-Map value is the selected labelfor each output pixel
• Labels: relative shift
Shift-Map as Graph Labeling
Output image pixels Input image
Shift Map:assign
a label to each pixel
Nodes:pixels
Labels: shift-map values (tx,ty)
Hierarchical SolutionGaussian pyramid
on input
Shift-Map
Shift-Map
Output
Shift-Map handles without additional user interaction some cases where other algorithms suggested that can only be handled with additional user guidance
Results and Comparison
J. Sun, L. Yuan, J. Jia, and H. Shum. Image completion with structure propagation. In SIGGRAPH’05
Shift-Map
Image completion with structure propagation [Sun et al. SIGGRAPH’05]
Mask
Application: Retargeting
Input Output
Non-Homogeneous[Wolf et al., ICCV’07]
PatchMatch[Barnes et al, SIGGRAPH‘09]
Improved Seam Carving [Robinstein et al, SIGGRAPH’08] Shift-Maps
Results andComparison
Summary
• New representation to geometrical editing applications as an optimal graph labeling
• Unified approach
• Solved efficiently using hierarchical approximations
• Minimal user interaction is required for various editing tasks
• Build an Output image R from pixels taken from Source image I such that R is most similar to Target image T
Similarity Guided Composition
Source ImageTarget Image Output
• Data term reflects a similarity between the output image R and a target image T
• Similarity uses both colors and gradients
Similarity Guided Composition
• Data term indicates the similarity of the output image to the target image
• Weight between similarity and smoothness has the following effect
Source ImageTarget Image
ResultedOutput
Previous Work: Efros and Freeman 2001, Hertzman et al. 2001
Similarity Guided Composition
Edge Preserving Magnification
Using the original image as the source, similarity guided composition can magnify
Does not work for gradual color changes
Source Target (bilinear magnification)
Result
Edge Preserving Magnification
Original image can be the source for edge areas. Otherwise the magnified image is the source.
Source 1Magnified Target
Source 2Original Edge Map
Edge Preserving Magnification
Bicubic Shift Map
Easy to compose (recover) source from target&
Easy to compose (recover) target from source
?
source target
The Bidirectional Similarity[Simakov, Caspi, Shechtman, Irani – CVPR’2008]
Completeness
⊆All source patches (at multiple scales) should be in the target
⊇ All target patches (at multiple scales) should be in the sourceCoherence
• It will he hard to reconstruct back the Fish
• Shift-Map retargeting maximize the coherence
Shift-Map Retargeting with Feedback
• Increase the Appearance Data Term of input regions with a high Composition Score E<A|B> and recompute the output B.
• Pixels with the higher Appearance Term will now appear in the output and increase the completeness.
Shift-Map Retargeting with Feedback
Original
Retargeted Reconstruction of Original
Appearance Term
E<A|B>E<A|B>
FeedbackOriginal Shift-Map
Shift-Map Retargeting with Feedback
Video Synopsis and IndexingMaking a Long Video Short
• 11 million cameras in 2008• Expected 30 million in 2013• Recording 24 hours a day, every day
t
Video SynopsisShift Objects in Time
Input Video I(x,y,t)
Synopsis VideoS(x,y,t)
• Detect and track objects, store in database.• Select relevant objects from database• Display selected objects in a very short
“Video Synopsis”• In “Video Synopsis”, objects from different
times can appear simultaneously• Index from selected objects into original video• Cluster similar objects
Steps in Video Synopsis
Two ClustersCars
People
Camera in St. Petersburg
• Detect specific events• Discover activity patterns
ICVSS 2011 Presentations
168.176.61.22/comp/buzones/PROCEEDINGS/ICVSS2011
Jiri Matas - Tracking, Learning, Detection, Modeling
Ivan Laptev - Human Action Recognition
Josef Sivic - Large Scale Visual Search
Andrew Fitzgibbon - Computer Vision: Truth and Beauty(Kinect)
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
The end...
Thanks !
Angel Cruz-Roa [email protected] Rueda-Olarte [email protected]
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations