Upload
bilkent-university
View
647
Download
1
Embed Size (px)
Citation preview
ConceptMap: Learning Visual Concepts from Weakly-Labeled WWW images
A work by Eren Golge
Supervised by Asst. Prof. Pinar Duygulu
Dictionary
● Visual Concept – a visual correspondence of semantic values – Objects (car, bus … ), attributes (red, metallic … ) or scenes
(indoor, kitchen, office …)● Polysemy – multiple semantic matching for a given
word● Model – Classifiers in Machine Learning sense● BoW – Bag of Words feature representation
Problems
● Hard to have Large labeled data
● Query Web sources : Google, Bing, Yahoo etc.
● Evade polysemy or irrelevancy in the gathered data
● Deal with Domain Adaptation
● Learn salient models
● Use lower concept models -objects- to discover higher level concepts – scenes -
General Pipeline
GATHER DATA from
CLUSTER and remove OUTLIERS
Learn Classifiers
Hassles
● Polysemy● Irrelevancy● Data size● Model learning
Method #1 : CMAP
Polysemy : Clustering
Irrelevancy : Outlier detection+Rectifying Self Organizing Map (RSOM)
Accepted for
Draft version : http://arxiv.org/abs/1312.4384
RSOM● Very Generic method for other domains as well (textual, biological etc.)
● Extension of SOM (a.k.a. Kohonen's Map) *
● Inspired by biological phenomenas **
● Able to cluster data and detect outliers
● IRRELEVANCY SOLVED!!
*Kohonen, T.: Self-organizing maps. Springer (1997)
**Hubel, D.H., Wiesel, T.N.: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160(1) (1962) 106
Outlier clusters
Outlier instances in salient clusters
RSOM cont'finding outlier units
● Look activation statistics of each SOM unit in learning phase
● Latter learning iterations are more reliable
IF a unit is activated REARLY → OUTLIER
FREQUENTLY → SALIENT
Winner activations Neighbor activations
RSOM cont'finding sole outliers
x
x
x
x
Learning Models● Learn L1 linear SVM models
– Easier to train
– Better for high dimensional data (wide data matrix)
– Implicit feature selection by L1 norm
● Learn one linear model from each salient cluster
● Each concept has multiple models
– POLYSEMY SOLVED!!
CMAP Overview
Retrospective● Fergus et. al. [1]
– They use human annotated control set to cull data– We use fully non-human afforded data
● Berg and Forsyth [3]– They use textual surrounding– We use only visual content
● OPTIMOL, Li and Fei-Fei [2]– They use seed images and update incrementally– We use no supervision with all in one iteration
● Efros et. al. [4] “Discriminative Patches”– They require a large computer clusters and iterative data elimination – We use single computer with faster and better results and no time wasting iterations.
● CMAP has broader possible applications
[1] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005[2] Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Learned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. In: IEEE Conference on Computer VisionPattern Recognition (CVPR). Volume 2. (2004) 848–854[3] Li, L.J., Fei-Fei, L.: Optimol: automatic online picture collection via incremental model learning. International journal of computer vision 88(2) (2010) 147–168[4] Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Computer Vision–ECCV 2012. Springer (2012) 73–86
Experiments● Only use images for learning
● Attack to problems:
– Attribute Learning : [1] , Images, Google [2], [2]
● Learn Texture and Color attributes
– Scene Learning : MIT-indoor [4], Scene-15 [5]
● Use Attributes as mid-level features
– Face Recognition : FAN-Large [6]
● Use EASY and HARD subset of the dataset
– Object Recognition : Google data-set [3]
[1] Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Trends and Topics in Computer Vision. Springer (2012) [2] Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. Image Processing, IEEE (2009)[3] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005[4] Quattoni, A., Torralba, A.: Recognizing indoor scenes. CVPR (2009)[5] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR 2006[6] Ozcan, M., Luo, J., Ferrari, V., Caputo, B.: A large-scale database of images and captions for automatic face naming. In: BMVC. (2011)
Visual Examples
Visual Examples # Faces
Salient Clusters Outlier Clusters Outlier Instances
Salient Clusters Outlier Clusters Outlier Instances
Implementation ● Visual Features :
– BoW SIFT with 4000 words (for texture attribute, object and face)
– Use 3D 10x20x20 Lab Histograms (for attribute)
– 256 dimensional LBP [1] (for object and face)
● Preprocessing
– Attribute: Extract random 100x100 non-overlapping image patches from each image.
– Scene: Represent each image with the confidence scores of attribute classifiers in a Spatial Pyramid sense
– Face: Apply face detection[2] to each image and get one highest score patch.
– Object: Apply unsupervised saliency detection [3] to images and get a single highest activation region.
● Model Learning
– Use outliers and some sample of other concept instances as Negative set
– Apply Hard Mining
– Tune all hyper parameters via X-validation on the (classifiers and RSOM parameters)
● NOTICE:
– We use Google images to train concept models and deal with DOMAIN ADAPTATION
[1] Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24(7) (2002) 971–987[2] Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 2879–2886[3] Erdem, E., Erdem, A.: Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision 13(4) (2013) 1–20
ResultsOurs State of art
Face 0.66 0.58 [1]
Object 0.78 0.75 [2]
Attribute Image-Net 0.37 0.36 [3]
Attribute ebay 0.81 0.79 [4]
Attribute bing 0.82 -
- We beat all state of art methods except scene recognition!!However our method is very cheaper compared to Li et al. [5]
[1] Ozcan, M., Luo, J., Ferrari, V., Caputo, B.: A large-scale database of images and captions for automatic face naming. BMVC. (2011) [2] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: Computer Vision, 2005. ICCV 2005 [3] Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: Trends and Topics in Computer Vision. Springer (2012)[4] Van De Weijer, J., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. Image Processing, IEEE (2009)[5] Li, Q., Wu, J., Tu, Z.: Harvesting mid-level visual concepts from large-scale internet images. CVPR (2013)
Last Words
● Fact – We propose a novel algorithm RSOM● Fact – Roughly beating all state-of-art methods● Fact – Solution for better data-sets with little or no
human effort
● Improvement – Try to estimate # clusters implicitly without any hyper parameter.
● Improvement – Use more complex classification scheme.
Not Much... Thanks for valuable time :)