Cmap presentation

ConceptMap: Learning Visual Concepts from Weakly-Labeled WWW images

A work by Eren Golge

Supervised by Asst. Prof. Pinar Duygulu

Dictionary

● Visual Concept – a visual correspondence of semantic values – Objects (car, bus … ), attributes (red, metallic … ) or scenes

(indoor, kitchen, office …)● Polysemy – multiple semantic matching for a given

word● Model – Classifiers in Machine Learning sense● BoW – Bag of Words feature representation

Problems

● Hard to have Large labeled data

● Query Web sources : Google, Bing, Yahoo etc.

● Evade polysemy or irrelevancy in the gathered data

● Deal with Domain Adaptation

● Learn salient models

● Use lower concept models -objects- to discover higher level concepts – scenes -

General Pipeline

GATHER DATA from

CLUSTER and remove OUTLIERS

Learn Classifiers

Hassles

● Polysemy● Irrelevancy● Data size● Model learning

Method #1 : CMAP

Polysemy : Clustering

Irrelevancy : Outlier detection+Rectifying Self Organizing Map (RSOM)

Accepted for

Draft version : http://arxiv.org/abs/1312.4384

http://arxiv.org/abs/1312.4384

RSOM● Very Generic method for other domains as well (textual, biological etc.)

● Extension of SOM (a.k.a. Kohonen's Map) *

● Inspired by biological phenomenas **

● Able to cluster data and detect outliers

● IRRELEVANCY SOLVED!!

*Kohonen, T.: Self-organizing maps. Springer (1997)

**Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160(1) (1962) 106

Outlier clusters

Outlier instances in salient clusters

RSOM cont'finding outlier units

● Look activation statistics of each SOM unit in learning phase

● Latter learning iterations are more reliable

IF a unit is activated REARLY → OUTLIER

FREQUENTLY → SALIENT

Winner activations Neighbor activations

RSOM cont'finding sole outliers

x

x

x

x

Learning Models● Learn L1 linear SVM models

– Easier to train

– Better for high dimensional data (wide data matrix)

– Implicit feature selection by L1 norm

● Learn one linear model from each salient cluster

● Each concept has multiple models

– POLYSEMY SOLVED!!

CMAP Overview

Retrospective● Fergus et. al. [1]

– They use human annotated control set to cull data– We use fully non-human afforded data

● Berg and Forsyth [3]– They use textual surrounding– We use only visual content

● OPTIMOL, Li and Fei-Fei [2]– They use seed images and update incrementally– We use no supervision with all in one iteration

● Efros et. al. [4] “Discriminative Patches”– They require a large computer clusters and iterative data elimination – We use single computer with faster and better results and no time wasting iterations.

● CMAP has broader possible applications

[1] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005[2] Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Learned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. In: IEEE Conference on Computer VisionPattern Recognition (CVPR). Volume 2. (2004) 848–854[3] Li, L.J., Fei-Fei, L.: Optimol: automatic online picture collection via incremental model learning. International journal of computer vision 88(2) (2010) 147–168[4] Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Computer Vision–ECCV 2012. Springer (2012) 73–86

Experiments● Only use images for learning

● Attack to problems:

– Attribute Learning : [1] , Images, Google [2], [2]

● Learn Texture and Color attributes

– Scene Learning : MIT-indoor [4], Scene-15 [5]

● Use Attributes as mid-level features

– Face Recognition : FAN-Large [6]

● Use EASY and HARD subset of the dataset

– Object Recognition : Google data-set [3]

[1] Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Trends and Topics in Computer Vision. Springer (2012) [2] Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. Image Processing, IEEE (2009)[3] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005[4] Quattoni, A., Torralba, A.: Recognizing indoor scenes. CVPR (2009)[5] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR 2006[6] Ozcan, M., Luo, J., Ferrari, V., Caputo, B.: A large-scale database of images and captions for automatic face naming. In: BMVC. (2011)

Visual Examples

Visual Examples # Faces

Salient Clusters Outlier Clusters Outlier Instances

Salient Clusters Outlier Clusters Outlier Instances

Implementation ● Visual Features :

– BoW SIFT with 4000 words (for texture attribute, object and face)

– Use 3D 10x20x20 Lab Histograms (for attribute)

– 256 dimensional LBP [1] (for object and face)

● Preprocessing

– Attribute: Extract random 100x100 non-overlapping image patches from each image.

– Scene: Represent each image with the confidence scores of attribute classifiers in a Spatial Pyramid sense

– Face: Apply face detection[2] to each image and get one highest score patch.

– Object: Apply unsupervised saliency detection [3] to images and get a single highest activation region.

● Model Learning

– Use outliers and some sample of other concept instances as Negative set

– Apply Hard Mining

– Tune all hyper parameters via X-validation on the (classifiers and RSOM parameters)

● NOTICE:

– We use Google images to train concept models and deal with DOMAIN ADAPTATION

[1] Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24(7) (2002) 971–987[2] Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 2879–2886[3] Erdem, E., Erdem, A.: Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision 13(4) (2013) 1–20

ResultsOurs State of art

Face 0.66 0.58 [1]

Object 0.78 0.75 [2]

Attribute Image-Net 0.37 0.36 [3]

Attribute ebay 0.81 0.79 [4]

Attribute bing 0.82 -

- We beat all state of art methods except scene recognition!!However our method is very cheaper compared to Li et al. [5]

[1] Ozcan, M., Luo, J., Ferrari, V., Caputo, B.: A large-scale database of images and captions for automatic face naming. BMVC. (2011) [2] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005 [3] Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Trends and Topics in Computer Vision. Springer (2012)[4] Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. Image Processing, IEEE (2009)[5] Li, Q., Wu, J., Tu, Z.: Harvesting mid-level visual concepts from large-scale internet images. CVPR (2013)

Last Words

● Fact – We propose a novel algorithm RSOM● Fact – Roughly beating all state-of-art methods● Fact – Solution for better data-sets with little or no

human effort

● Improvement – Try to estimate # clusters implicitly without any hyper parameter.

● Improvement – Use more complex classification scheme.

Not Much... Thanks for valuable time :)

Science

Cmap presentation