10
Scene recognition by semantic visual words Elahe Farahzadeh · Tat-Jen Cham · Andrzej Sluzek Received: 20 January 2014 / Revised: 17 May 2014 / Accepted: 5 August 2014 © Springer-Verlag London 2014 Abstract In this paper, we propose a novel approach to introduce semantic relations into the bag-of-words frame- work. We use the latent semantic models, such as latent semantic analysis (LSA) and probabilistic latent semantic analysis (pLSA), in order to define semantically rich fea- tures and embed the visual features into a semantic space. The semantic features used in LSA technique are derived from the low-rank approximation of word–image occurrence matrix by singular value decomposition. Similarly, by using the pLSA approach, the topic-specific distributions of words can be considered dimensions of a concept space. In the proposed space, the distances between words represent the semantic distances which are used for constructing a dis- criminative and semantically meaningful vocabulary. Posi- tion information significantly improves scene recognition accuracy. Inspired by this, in this paper, we bring position information into the proposed semantic vocabulary frame- works. We have tested our approach on the 15-Scene and 67-MIT Indoor datasets and have achieved very promising results. E. Farahzadeh (B ) Center of Computational Intelligence, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore e-mail: [email protected] T.-J. Cham Center for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore e-mail: [email protected] A. Sluzek Department of Electrical and Computer Engineering, Khalifa University of Science Technology and Research, 127788 Abu Dhabi, UAE e-mail: [email protected] Keywords Scene recognition · Semantic vocabulary · Visual words 1 Introduction The bag-of-words (BOW) framework has been shown to be useful in various computer vision applications like object recognition [3], scene recognition [6]. The framework builds a visual vocabulary by vector quantization of raw features extracted from local image patches. The vector quantization essentially involves clustering of the raw features by k -means and choosing a cluster’s mean as the codebook or visual word. However, an important drawback of k -means clustering is that it is based on the appearance of the image or video as represented in the raw features, as opposed to being based on the semantic relations between features. Utilizing the seman- tics inherent in visual content improves image/video catego- rization and understanding. There have been several attempts to incorporate seman- tics into the BOW model so that a more discriminative visual vocabulary is realized. Generative methods use latent variable models like Probabilistic latent semantic analysis (pLSA) [21, 30] and latent dirichlet allocation (LDA) [6, 21] to obtain models for each category and subsequently to fit the query to one of the models in an unsupervised man- ner. Although these methods are efficient, their unsuper- vised nature limits their performance. Moreover, the num- ber of topics in these methods is equal to the number of categories. This too limits their efficiency. Discriminative methods which incorporate label information have also been explored. Among the recent methods is the notable work of Liu and Shah which finds a semantic visual vocabulary via Maximization of Mutual Information (MMI) between visual words and images [16] or videos [17]. The algorithm starts 123

Scene recognition by semantic visual words

Embed Size (px)

Citation preview

Scene recognition by semantic visual words

Elahe Farahzadeh · Tat-Jen Cham · Andrzej Sluzek

Received: 20 January 2014 / Revised: 17 May 2014 / Accepted: 5 August 2014© Springer-Verlag London 2014

Abstract In this paper, we propose a novel approach tointroduce semantic relations into the bag-of-words frame-work. We use the latent semantic models, such as latentsemantic analysis (LSA) and probabilistic latent semanticanalysis (pLSA), in order to define semantically rich fea-tures and embed the visual features into a semantic space.The semantic features used in LSA technique are derivedfrom the low-rank approximation of word–image occurrencematrix by singular value decomposition. Similarly, by usingthe pLSA approach, the topic-specific distributions of wordscan be considered dimensions of a concept space. In theproposed space, the distances between words represent thesemantic distances which are used for constructing a dis-criminative and semantically meaningful vocabulary. Posi-tion information significantly improves scene recognitionaccuracy. Inspired by this, in this paper, we bring positioninformation into the proposed semantic vocabulary frame-works. We have tested our approach on the 15-Scene and67-MIT Indoor datasets and have achieved very promisingresults.

E. Farahzadeh (B)Center of Computational Intelligence, School of ComputerEngineering, Nanyang Technological University,Singapore 639798, Singaporee-mail: [email protected]

T.-J. ChamCenter for Multimedia and Network Technology, Schoolof Computer Engineering, Nanyang Technological University,Singapore 639798, Singaporee-mail: [email protected]

A. SluzekDepartment of Electrical and Computer Engineering, KhalifaUniversity of Science Technology and Research,127788 Abu Dhabi, UAEe-mail: [email protected]

Keywords Scene recognition · Semantic vocabulary ·Visual words

1 Introduction

The bag-of-words (BOW) framework has been shown to beuseful in various computer vision applications like objectrecognition [3], scene recognition [6]. The framework buildsa visual vocabulary by vector quantization of raw featuresextracted from local image patches. The vector quantizationessentially involves clustering of the raw features by k-meansand choosing a cluster’s mean as the codebook or visual word.However, an important drawback of k-means clustering isthat it is based on the appearance of the image or video asrepresented in the raw features, as opposed to being based onthe semantic relations between features. Utilizing the seman-tics inherent in visual content improves image/video catego-rization and understanding.

There have been several attempts to incorporate seman-tics into the BOW model so that a more discriminativevisual vocabulary is realized. Generative methods use latentvariable models like Probabilistic latent semantic analysis(pLSA) [21,30] and latent dirichlet allocation (LDA) [6,21]to obtain models for each category and subsequently to fitthe query to one of the models in an unsupervised man-ner. Although these methods are efficient, their unsuper-vised nature limits their performance. Moreover, the num-ber of topics in these methods is equal to the number ofcategories. This too limits their efficiency. Discriminativemethods which incorporate label information have also beenexplored. Among the recent methods is the notable work ofLiu and Shah which finds a semantic visual vocabulary viaMaximization of Mutual Information (MMI) between visualwords and images [16] or videos [17]. The algorithm starts

123

SIViP

with singleton clusters and in each iteration, merges two clus-ters which result in the minimum loss in mutual information.This procedure continues until a certain threshold in the infor-mation loss or in the number of clusters is achieved. Thisapproach is effective in discovering the optimum number ofclusters, but the formed clusters do not necessarily representtopics or synonym words which is required for constructingdiscriminative histograms.

Liu et al. [18] use diffusion map (DM) to construct asemantic visual vocabulary. Unlike geodesic distance whichis based on the shortest path between points, diffusion dis-tance considers all paths between two points to measure theshortest distance, and, hence, is not sensitive to noise. How-ever, considering connectivity in measuring the semantic dis-tance is not appropriate in the presence of polysemy.1 Forexample, assume that word B is a polyseme with two distinctmeanings: 1 & 2. If word B is connected to word A basedon meaning 1 (they both have the same meaning 1) and alsoword B is connected to word C based on meaning 2, thenwords A and C will be connected in the diffusion distanceframework, but they convey different meanings. So, diffusiondistance does not always represent semantic distance.

Considering these drawbacks, we propose a method forscene recognition based on a semantic visual vocabulary thatuses latent aspect models to embed visual words into a richsemantic space which we call the concept space. Using latentsemantic analysis (LSA) or its probabilistic version pLSA,the synonym words which convey the same meanings areembedded close to each other so that they can be clusteredtogether into the same semantic cluster. The distance in theproposed concept space is actually based on the meanings ofthe words, and thus, it represents the semantic relations. Con-sequently, the formed histograms based on these semanticclusters are efficient and discriminative for scene recognition.

In contrast with generative methods that do not make useof category labels, our method trains a classifier using thehistograms from the training set. Moreover, in our method,the number of topics can be changed as opposed to the unsu-pervised framework where this number is fixed and equalto the number of classes. This will allow us to analyze thesemantic relations in more details and to consider as manytopics as appropriate. On the other hand, pLSA is able tohandle polysemy which is very effective in cases when dif-ferent categories share the same topics e.g., office, livingroomand kitchen categories include similar visual tokens such aschairs and tables.

1 Polysemy is the existence of words which convey different conceptsin different images. For instance in text domain, the word table caneither be interpreted as a piece of furniture or an arrangement of data.

2 Related works

In this section, we review some methods that attempt to intro-duce semantic relations in the BOW framework for objectand scene recognition. These approaches can be broadlydivided into generative and discriminative methods. Gener-ative methods usually involve hidden variables. These meth-ods [6,30] try to model each image as a mixture of hiddenconcepts using either pLSA or LDA. On the other hand, dis-criminative methods are only based on observed variables.These approaches usually incorporate a classifier. Amongthese methods, Vogel and Schiele [31] define a set of con-cept classes (visual words) like sand, sky and sea to labelimage regions. In this method, image regions are representedby a combination of color and texture features and classifiedinto concept classes. Thus for each image, a concept occur-rence vector is constructed and classified for scene retrieval.In labeling the databases containing ambiguous images, thisapproach claims that obtaining the ground truth for localsemantic concepts is easier than for the whole image. How-ever, this approach suffers from the large amount of manualwork needed to annotate local regions. Randomized clus-tering forests have been used by Moosmann et al. [20] forimage classification. Ensembles of decision trees are con-structed based on the image class labels. Subsequently, visualwords are assigned to each leaf. After building the trees, abottom–up pruning process is done to reach a threshold ofnumber of leaves to control the codebook size. They use ran-domized forests for clustering and quantization. Random-ized forests have also been used in [10] for object classifica-tion and segmentation. In their work, decision trees are useddirectly on image pixels to save time for extracting descrip-tors. In contrast to [20], which uses forests for clusteringonly, they use forests for both clustering and classificationpurposes. Randomized clustering forest is fast, yet discrim-inative, when compared to conventional k-means cluster-ing, However, it tends to overfit, especially when appliedin noisy situations. Moreover, the model is complex and itis hard to understand the relation between the predictor vari-ables.

Quelhas et al. [27] have used the pLSA model to extractimage-specific distributions of topics in order to representimages. This is followed by a SVM classifier in order to clas-sify scenes. Bosch et al. [1] have also used a similar frame-work. Liu et al. [18] use diffusion distance to build a seman-tic dictionary. They construct a graph on the visual wordsin which the weights between points reflect the similarity.Visual words are represented by pointwise mutual informa-tion. By applying the diffusion map, points are embeddedinto a lower-dimensional space in which Euclidean distanceis equal to diffusion distance.

The approach proposed in this paper is a discriminativeembedding method that projects visual words into a concept

123

Author's personal copy

SIViP

Fig. 1 Constructing the semantic visual vocabulary

space where the dimensions are discovered in an unsuper-vised manner by latent topic models.

3 Overview of the proposed framework

In the proposed framework after projecting images into thehistograms of visual words, these histograms are embeddedinto less dimensional space using LSA/pLSA. Our main con-tribution is co-clustering the visual words which are seman-tically closed together at this new space to build seman-tic vocabulary. Afterward, we propose two approaches tobuild spatial pyramids over the semantic visual words. Fig-ure 1 shows the flowchart for constructing the semanticvisual vocabulary via embedding into concept space. Wefirst extract features from patches in the images. The initialvocabulary is constructed by performing k-means clusteringon the extracted features and choosing the cluster centers asthe codewords. The feature vectors are quantized based onthe initial codebook to form the word–image matrix whichdescribes the occurrences of words in images. The codewordsare then embedded into the concept space by latent seman-tic models—we demonstrate the embedding both by LSA aswell as by pLSA. Finally, the embedded codewords in theconcept space are again clustered using k-means to obtainthe desired semantic visual vocabulary.

The contributions of this paper are threefold:

Using word space All of the methods that project imagesinto latent semantic space by pLSA work in new semanticdocument space. This means that, in order to classifya test image, the image is projected into the semanticdocument space. However, the focus in our framework isthe semantic word space. The similar visual words in theword space are co-clustered together to form a semanticvisual vocabulary. In our framework, in order to classifya test image, it is directly quantized based on the semanticvisual vocabulary without embedding the image.Investigating the changes in the number of topics In thegenerative frameworks employing pLSA, the number oftopics is considered the same as the number of categories.This is in contrast to our method which analyzes thechanges in the number of topics and empirically fixesthe best fit for representing the semantic relations in thescene.Using LSA embedding Our framework applies both LSAand pLSA semantic embedding. To the best of our knowl-edge, there is no other method that uses LSA as semanticembedding although, according to our experiments, the

results of LSA and pLSA are almost on par. However,the time and memory complexity for LSA are substan-tially less since it uses a simple singular value decompo-sition (SVD) compared to pLSA which uses an expensiveexpectation maximization algorithm. This is the reasonthat using LSA is really advantageous.

4 Concept space

We obtain the initial vocabulary by performing k-means clus-tering on the extracted visual features. This initial codebookforms a reasonable-sized set representing all features, butthey are not semantically clustered, i.e., the features in acluster may convey different concepts. Thus, the formed his-tograms will not be semantically discriminative for classi-fication. Therefore, we need a space in which semanticallyrelated words are adjacent. In order to find such a space,we use latent semantic models that find the underlying latentsemantics given the occurrence matrix of word–image. Thesemodels are the well-known LSA and pLSA, which we brieflyreview in the following sections. We use tf-idf instead of thenormal count in the occurrence matrix for a higher efficiency.

4.1 Embedding into concept space using latent semanticanalysis

LSA [4] finds a low-rank approximation for the word–imagematrix. The word–image matrix itself delivers semanticssince synonym words appear in similar images resulting insimilarities among their occurrence vectors. However, theoriginal word–image matrix is noisy, sparse and large. Hence,the low-rank approximation to the original matrix is desir-able. The consequence of this dimension reduction is that thedimensions relating to synonym words (e.g., see and look intext domain) are merged. It means the LSA hypothesis is thatthe synonym terms will have the same direction in the latentsemantic space.

Let X be the occurrence matrix whose rows correspondto words and columns correspond to images. Decomposingof X using SVD, i.e.,

X = U!V T (1)

gives the orthogonal matrices U and V and the diagonalmatrix ! that contains the singular values of X . By Selectingthe L largest singular values and their corresponding singularvectors, we find the rank-L approximation of X by:

X ≈ UL!L VLT . (2)

This L-rank is optimal in a sense of l2 matrix norm. Thecolumn vectors of UL span the concept space of words and thecolumns of VL span the concept space of images, so we canconsider M × L matrix UL the word space and L × N matrixVL the concept space. The SVD decomposition process is

123

Author's personal copy

SIViP

Fig. 2 SVD decomposition, word space and concept space

Fig. 3 Graphical view of pLSA model

illustrated in Fig. 2. The i th row of X , ti , describes the i thword. Consequently, the i th row of UL is the description ofthe i th word in the concept space with L concepts and werefer to it as t̂i . In fact, each of the L dimensions in the low-dimensional vector t̂i shows the projection of the words alongone of the concepts. It is expected that synonym words areclose in the concept space.

4.2 Embedding into concept space using probabilistic latentsemantic analysis

pLSA [9] is the statistical version of LSA which definesa generative model on the data. It is assumed that there isa latent topic variable zl associated with occurrence of theword wi in the image d j . It is expected that the joint probabil-ity P(wi , d j , zl) follows the form of the graphical model inFig. 3. The observed variables are wi and d j while zl is latent.The probability of observation pair P(wi , d j ) is as follows:

P(wi , d j ) = P(wi |d j )P(d j ). (3)

Since the occurrences of wi and d j are assumed to be inde-pendent, we can marginalize over latent topics zl in order tofind the conditional probability P(wi |d j ), i.e.,

P(wi |d j ) =L!

l=1

P(wi |zl)P(zl |d j ), (4)

where P(zl |d j ) is the probability of occurrence of topic zlin the document d j and P(wi |zl) is the probability of occur-rence of word wi given the topic zl . L is the total numberof latent topics. Equation 4 is a decomposition of the word–document matrix, similar to LSA, but with the condition thatthe values are normalized to be probability distributions. Wefit the model by determining P(zl |d j ) and P(wi |zl) given theobservation occurrence matrix. Maximum likelihood esti-mation of the parameters is performed using Expectation

Maximization (EM) algorithm. Assuming a vocabulary ofM words and N documents, the likelihood function to bemaximized is:

M"

i=1

N"

j=1

P(wi |d j )n(wi ,d j ), (5)

where n(wi , d j ) is the number of words wi in the documentd j and P(wi |d j ) is obtained by Eq. 4.

The original pLSA algorithm in the unsupervised learn-ing framework tries to categorize the query document giventhe learned parameters [9]. However, we use the pLSA algo-rithm only to determine the probabilities P(wi |zl). In fact,P(wi |zl) is equivalent to the lth dimension of t̂i in the LSAframework. Therefore, using pLSA we obtain the conceptspace embedded vector t̂i as:

t̂i = [p(wi |z1) p(wi |z2) · · · p(wi |zL)]T . (6)

It should be noted here that L , the dimension of the conceptspace, does not need to be equal to the number of classes;this enables us to define arbitrary latent concepts. In fact, thenumber of semantic topics can be much more than the num-ber of classes. In other words, classes are wider concepts thatmay include some finer and more detailed concepts which arereferred to as topics. For instance, “computer” and “pen” aretopics related to the class of office. Note that in the unsuper-vised framework in which pLSA is used (e.g, in [30]), thedimension of the concept space must be equal to the numberof classes.

In LSA, each word is projected into a single point in theconcept space so that each word can refer to a single meaningonly. Instead pLSA is able to capture polysemy. Thus, givena word w observed in two different documents di and d j , thetopics associated with the word in di and d j can be differentor, in other words, arg max p(z|di , w) can be different fromarg max p(z|d j , w) [9,15]. The advantage of LSA comparedto pLSA is the faster and easier implementation. LSA needsa simple SVD, while pLSA uses the iterative EM algorithmwhich is only guaranteed to find a local maximum of thelikelihood function [9].

5 Experimental results

To demonstrate the efficiency of our method, we have eval-uated our method on two challenging scene datasets: the 15-Scene [6,12] dataset and the MIT 67-Indoor Scenes [26].

The features applied in these experiments use the patch of16 × 16 size sampled densely with the M = 8 pixel spacing.The feature descriptor applied on each patch is SIFT [19]. For15-Scene dataset, we randomly select 100 images per cate-gory as training images and the rest for testing. The resultsare averaged over five times random splitting of training andtesting images. For MIT 67-Indoor Scenes, we used the exact

123

Author's personal copy

SIViP

Fig. 4 Performance ofproposed method on 15-Scenedataset with different number oftopics using LSA (a)/pLSA (b)

same partitions as used in [26] which contain 80 imagesfor training and 20 images for testing, so that all results aredirectly comparable.

Support vector machine (SVM) with histogram intersec-tion kernel (HIK) is used as the classifier, while the size ofthe initial vocabulary (Ki ) is fixed to 1,500 during the exper-iments.

5.1 Results and analysis

One of the advantages of the proposed method is that it allowsthe number of topics to be varied, in contrast to pLSA usingunsupervised framework (where the number of topics is con-strained to be the same as the number of classes). Figure 4a,b shows how the number of topics L affects the recognitionaccuracy with LSA and pLSA as the embedding method.The experiments have been performed using three differentsemantic vocabulary sizes, K f . As the number of topics isincreased above L = 15 (which is the number of classes),the recognition rate increases since the increased number oftopics enables a better discrimination between classes. How-ever, after around L = 70 topics, the recognition accuracydecreases. This is mainly because adding more dimensionsto the concept space implies further division into semanticunits that are not meaningful. The recognition accuracy hasa variance of about 1–3 % as L varies.

According to the experiments, the number of topics forobtaining the best accuracy for the three different values ofK f is the same. Thus, the number of topics is independentof the final vocabulary size.

To verify the efficiency and discriminative ability of themethod in scene classification, we have compared it withthe classic BoW framework for different vocabulary sizes.The results are shown in Fig. 5. The number of topics ischosen to be 70. According to the figure, our method outper-forms the classic BOW framework in all cases. This showsthe efficiency of the method proposed. Apart from the insta-bilities in the initial parts of the curves, we can say that thebehaviors of LSA and pLSA are consistent. For small vocab-ulary sizes, pLSA outperforms LSA by a small margin dueto its ability to handle polysemy. However, as vocabularysize increases, LSA performs better than pLSA (at approxi-mately K f = 900). Based on the differences between pLSA

Fig. 5 Comparison of results with the classic framework for differentsizes of vocabulary in 15-Scene dataset

Fig. 6 Performance of proposed method on 67-MIT Indoor datasetusing LSA concept space by changing the number of topics

and LSA embedding, the possible cause of this effect is thatthe larger vocabulary size brings in more details and com-pensates for the effect of polysemy. However, pLSA takesinto account every possible meaning of a word, even the rareones, which results in confusion in larger vocabularies, thusreducing the accuracy. Also it should be noted that LSA hasalways a shorter implementation time compared to pLSA.This is due to a time-consuming iterative EM process forpLSA compared to the straightforward SVD in LSA.

The best result achieved on 15-Scene dataset with ourmethod is 79.22 using pLSA model and the semantic code-book size of 600.

Figure 6 shows how the number of topics L affects therecognition accuracy with LSA as the embedding method inMIT 67-Indoor Scenes. According to the experiments, thenumber of topics for obtaining the best accuracy for all threedifferent values of K f is the same. As the number of topics is

123

Author's personal copy

SIViP

Fig. 7 Comparison of results with the classic framework for differentsizes of vocabulary in 67-MIT Indoor dataset

Table 1 Semantic visual words versus original visual words

The first column shows the scene images, the second column illustratesthe results of applying the original visual vocabulary to the scene imageand the third column illustrates the results of quantizing dense SIFTvisual features based on the semantic visual words. Each visual wordis illustrated with one color. Less colors in the marked areas shows thatvisual words in the semantic areas are clustered together

increased above L = 67 the recognition rate increases sincethe increased number of topics enables a better discriminationbetween classes. However, after around L = 150 topics, therecognition accuracy decreases. As mentioned before, this isbecause adding more dimensions to the concept space impliesfurther division into semantic units that are not meaningful.

Figure 7 shows the efficiency of our framework usingeither pLSA or LSA embedding in comparison with classicbag-of-words.

To confirm the advantage of the semantic visual vocabu-lary compared to the original visual vocabulary, in Table 1we present the results of applying both vocabularies to fourdifferent scene images. For clearer illustration, we have onlyshown a subset (40) of the visual words. Each color is associ-ated with one visual word. As seen in the tables third column,which is associated with the semantic visual vocabulary, theareas are more uniform, especially in the marked regions bythe rectangles in the scene image.

6 Capturing image spatial information in scenerecognition systems

In the proposed methods in Sect. 3, the spatial informationwas not being considered. In this section, we try to makeuse of spatial information to increase the efficiency. Spatialarrangements of visual features have a significant impact onimage classification systems. Lazebnik et al. [12] developeda spatial version of the pyramid match kernel (PMK) [7]to overcome the lack of spatial information in the bag-of-words framework. The pyramid match kernel was proposedby Grauman and Darrell [7] on feature space to find thecorrespondence between feature sets (such as bag-of-wordsfeature vectors) using a hierarchical quantization. Lazebniket al. [12] have used spatial pyramid matching (SPM) onthe image space, i.e., instead of dividing the feature spaceinto hierarchical levels, the pyramid levels are built over theimage sub-regions. SPM successfully accomplishes the goalof incorporating spatial information into the bag-of-wordsframework. There are frameworks which incorporate spa-tial pyramid structure and better classification accuracy wasreported [2,11,32].

In this section, we impose spatial information on theproposed framework in Sect. 3. We propose two methods(global and region-based) to capture location informationwhen using a semantic vocabulary. Both methods build spa-tial pyramids over the image’s blocks. The methods differ inthat the Global method initially forms the semantic vocab-ulary and then divides the image into spatial sub-regions,while the Region-based method forms the semantic vocabu-lary over each region individually.

6.1 Capturing spatial information with spatial pyramidmatching (SPM)

The spatial pyramid matching technique is a simple yet effi-cient framework. SPM gives a multi-resolution representa-tion of the image by dividing it into increasingly finer sub-regions. It was first proposed by Lazebnik et al. [12] andinspired by the pyramid match kernel (PMK). PMK definesdifferent levels of resolution in the feature space by definingincreasingly coarser grids [7]. Although PMK is very precise,it ignores the spatial location of individual features. SPM hasbeen successfully used as an extension of the bag-of-wordsframework. The bag-of-words framework gives an orderlessrepresentation of the image, while SPM uses a spatial pyra-mid representation of the image. The matching score of thispyramid representation is obtained by a weighted combina-tion of histogram intersections at multi-spatial resolutions.

The image local feature f is denoted by f = (x, y, d)

where (x, y) shows the feature’s location coordinates andd represents the feature descriptor. In the BoW framework,there are K discrete cluster centroids where each centroid

123

Author's personal copy

SIViP

Fig. 8 Example of constructing a three-level pyramid

represents one of the visual words. Feature f is quantizedinto one of these centroids with respects to its descriptor, butthe location coordinates (x, y) are completely ignored.

On the other hand, in the SPM framework, these locationcoordinates are used to enhance the power of BoW, e.g., inFig. 8, the image is divided into 2l × 2l uniform sub-regionsat level l of resolution with 2l evenly sized partitions in eachdimension (horizontal or vertical). At higher levels of reso-lution, the image is divided into finer regions. At resolutionl (0 ! l ! L), feature f is assigned to one of the 2l ×2l sub-regions based on its location coordinates, while its descriptoris quantized into one of the K centroids.

Given the two-dimensional vectors Xk and Yk as the set ofcoordinates for channel k, Hl

Xkand Hl

Ykare our histograms at

level l of resolution, where the dimension of this histogramis D. To find the number of matches at level l of channel k,the HIK is applied to the Hlk histograms:

I (HlXk

, HlYk

) =D!

i=1

min(HlXk

(i), HlYk

(i)). (7)

The matching score in channel m is measured by applyingPMK:

K L(Xk, Yk) = 12L I 0 +

L!

l=0

12L−l+1 I l . (8)

The pyramid matching formulation (K L(Xk, Yk)) shows thatthe weights are inversely proportional to the size of the sub-regions. The final match kernel is sum of the match scoresfrom all of the K channels:

K L(X, Y ) =K!

1

kL(Xk, Yk). (9)

6.2 Spatial content capture using global and region-basedmodels

In this section, the SPM schema is incorporated into the newconcept space proposed in Sect. 3. To perform the spatial

content capture, we suggest two methods: global and region-based. In the global method, after projecting the word spaceinto the concept space, the k-means algorithm is appliedto co-cluster the synonymous words. The image is dividedinto increasingly finer blocks. Afterward, the visual featureswithin each sub-region are quantized into histograms of bag-of-words based on the appearance-based vocabulary, andthen, the histogram bins of the synonymous words are mergedtogether.

The final spatial representation based on the semanticvocabulary is obtained after applying the spatial pyramidweighting for each level of resolution.

In the region-based method, there is a separate conceptspace for every spatial block of the image, i.e., the latentsemantic models are applied to each sub-region after it isquantized into the bag-of-words histogram according to theappearance-based vocabulary. K-means clustering is appliedto these new regional word-topic (concept) spaces, and con-sequently, the histogram bins for the synonymous words inthe bag-of-words representation are bound together.

We argue that the region-based method is more effec-tive than the global method because the spatial locationsare considered while co-clustering the synonym words inthe region-based method; this is important because semanticvisual words differ according to the visual features locatedon that partition of the image.

Although the experiments show that the region-basedmethod outperforms the global method, the improvementsin the results are not very significant and it is due to theessence of visual word vocabularies. Since we have only onesemantic vocabulary for the whole dataset while buildingthis vocabulary, the location of the features is not important;rather, the most important issue is which part of the imagewe project by this vocabulary.

6.3 Experimental evaluation on 15-Scene dataset

In these experiments, the number of topics is fixed to 70which, according to [28] results in the highest accuracy. Inthe first experimental series, we apply the global method toimpose spatial information. As shown in Fig. 9a, b, the resultsare very promising and the diagrams prove the efficiency ofour method. In these figures, we study the influence of 3 dif-ferent initial vocabulary sizes (Ki ) on the recognition accu-racy while the final semantic vocabulary size (K f ) changes.In Fig. 9a, the LSA method is used as the embedding method.According to this figure, the best accuracy is achieved withKi = 1,500 and a final vocabulary size of K f = 800. We usethe pLSA model to project the word–document space into theconcept space, the results are illustrated in Fig. 9b, and thebest accuracy is achieved at Ki = 1,500 and K f = 600.

To implement the proposed global method, we apply pyra-mid matching on different levels of resolution. The results of

123

Author's personal copy

SIViP

Fig. 9 Performance ofproposed global method usingLSA (a)/pLSA (b)

Fig. 10 Recognition accuracyusing pLSA (b) and LSA (a)concept space in different levelsof resolution

Fig. 11 Comparing the performance of proposed global method versusregion-based method using LSA

the global method in level 1 and 2 of resolution for the LSAand pLSA models are demonstrated in Fig. 10.

The time complexity of implementing region-based modelby applying pLSA embedding is very high, therefore, toperform the experiments for this model we just used LSAembedding. The experimental results for the region-basedapproach using LSA are demonstrated in Fig. 11 while theinitial vocabulary size is fixed to 1,500, Ki = 1,500. This fig-ure shows the efficiency of the region-based method in usingLSA. According to this plot diagram, the accuracy increasesuntil we reach K f = 800, then it slightly decreases.

To compare the global method versus the region-basedmethod using LSA model, the results of these methods areillustrated in Fig. 11, with the initial vocabulary size fixed toKi = 1,500. According to this demonstration, although theregion-based method outperforms the global method, theirresults are almost on par. However, since it is necessary toproject each spatial block separately into a new concept spacein the region-based method, the time complexity is very high.

Table 2 Comparison with recently reported results for 15-Scene

Method Accuracy(%)

Our method 84.82

Wu and Rehg [32] 83.88

Bosch et al. [2] 83.7

Lazebnik et al. [12] 81.4

Li et al. [13] 80.9

Saghafi et al. [28] 79.22

Parizi et al. [25] 78.6

Liu and Shah [16] 75.16

Liu et al. [18] 74.9

Oliva and Torralba [22] 74.10

Bosch et al. [1] 73.30

Quelhas et al. [27] 71.24

Fei-Fei and Perona [6] 65.2

Recognition rate of the proposed method is in bold

Therefore, the difference in time complexity between theglobal and region-based methods justifies using the globalmethod.

Table 2 summarizes recognition accuracy of our methodand some notable related works. As seen from the table,the proposed method is the best. The work of Fei-Fei andPerona [6] which has used LDA in a generative framework,has a lower performance compared to the methods using asemantic vocabulary but incorporating category labels likeLiu (DM) [18] and Liu (MMI) [16]. Also the works of Quel-has et al. [27] and similarly Bosch et al. [1] have lower accu-racy compared to the works using co-clustering to obtainsemantic vocabulary like Liu (MMI) [17], Liu (DM) [18]

123

Author's personal copy

SIViP

Fig. 12 Performance of ourmethod using Spatial-LSA (a),Spatial-pLSA (b) concept space

Fig. 13 Performance of different spatial content capture methods

and ours. This is mainly because in contrast to former meth-ods, which use a histogram of topics equal to the numberof categories, latter methods perform the clustering step inthe semantic space to further group the semantically relatedwords together and to construct more discriminative his-tograms with the actual number of topics.

6.4 Experimental evaluation on 67-MIT Indoor dataset

We have evaluated our global spatial method on 67-MITIndoor dataset in LSA/pLSA concept space. The result ofthis evaluation is illustrated in Fig. 12a, b. As seen in thesefigures, recognition accuracy by applying spatial-semanticvocabulary is remarkably higher than the original semanticvocabulary in both LSA and pLSA concept spaces (Fig. 12).

In Fig. 13, the recognition accuracy for original SPMmethod is compared to spatial-pLSA and spatial-LSA. Inthis evaluation, the level of resolution is set to L = 3 for thethree methods.

There are approaches that try to deal with scene recogni-tion challenges by imposing high-level concepts [5,8,13,14,24–26,29]. In [13], the images are represented by a vocabu-lary of objects called object-bank, despite the high computa-tional complexity, the method does not offer much increase inrecognition accuracy. Li and Gua [14] improve scene recog-nition performance by capturing objects’ co-occurrence andtheir geometric correlations. They build a three levels (super-pixel, object, scene) hierarchical model, the operation in allof the levels is performed automatically.

In [1], the images are considered as a mixture of seman-tic topics. If followed by SVM classification, the classifica-tion rate is lower than our method. Oliva and Torralba [22]and Wu and Rehg [32] use gist and CENTRIST globalfeatures respectively, bypassing object-centered and localinformation. Although the recognition rates in these holis-tic methods were higher than those for purely local meth-ods, they were still lower than our reported results. Cap-turing the spatial location of local patches in [12] and [32]significantly improved the recognition accuracy for scenerecognition. In [24] Pandey and Lazebnik used the pop-ular Deformable Part-based Model(DPM) [23] for scenerecognition and achieved the accuracy of 30.08 %. Subse-quently they combined DPM results with color, gist andSIFT spatial pyramids information and achieved an accu-racy of 43.1 %. The discriminative mid-level patches in [29]achieved 38.1 % accuracy while combining these mid-levelpatches with color, gist, DPM and SIFT spatial pyramids theyobtained 49.4 % recognition accuracy.

7 Conclusion

In this paper, we have proposed a novel approach for usingsemantic relations in BoW framework. We have used thelatent aspect models such as LSA and pLSA to map thevisual words into a semantic space. Under the LSA frame-work, this mapping is done by a low-rank decomposition ofthe word–document occurrence matrix using SVD. Also, thetopic-specific distributions of words are considered (usingpLSA) as the projections words onto different concepts. Thedistances in the proposed concept space reveal the seman-tic relations. Clustering is done in the concept space to cap-ture the semantic structures. Also our method performs bettercompare to some similar methods for constructing semanticvocabularies.

In this paper, also a spatial method is proposed. The pro-posed spatial method incorporate the pyramid matching tech-nique. In semantic vocabulary, pyramid levels are constructeddirectly over the latent semantic space using either one con-cept space for the entire scene or each image sub-scene sep-arately projected into the concept space. The experimental

123

Author's personal copy

SIViP

Table 3 Comparison with recently reported results for 67-IndoorScenes

Method Accuracy (%)

Our method 37.68

Singh et al. [29] 38.1

Parizi et al. [25] 37.93

Li et al. [13] 37.6

Wu and Rehg [32] 36.9

Lazebnik et al. [12] 34.4

Pandey and Lazebnik [24] 30.8

Quattoni and Torralba [26] 26.00

Quelhas et al. [27] 21.17

Oliva and Torralba [22] 22.0

Bosch et al. [1] 20

Recognition rate of the proposed method is in bold

evaluation shows remarkable improvements in both the 15-Scene and 67-Indoor Scenes datasets (Table 3).

References

1. Bosch, A., Zisserman, A., Muñoz, X.: Scene classification viapLSA. In: European Conference on Computer Vision (ECCV)(2006)

2. Bosch, A., Zisserman, A., Muoz, X.: Scene classification usinga hybrid generative/discriminative approach. IEEE Trans. PatternAnal. Mach. Intell. 30(4), 712–727 (2008)

3. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visualcategorization with bags of keypoints. In: International Workshopon Statistical Learning in Computer Vision, ECCV (2004)

4. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harsh-man, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci.41(6), 391–407 (1990)

5. Farahzadeh, E., Cham, T.J., Li, W.: Incorporating local and globalinformation using a novel distance function for scene recogni-tion. In: IEEE Workshop on Robot Vision, Winter Vision Meetings(WVM) (2013)

6. Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learn-ing natural scene categories. In: IEEE International Conference onComputer Vision and Pattern Recognition (CVPR) (2005)

7. Grauman, K., Darrell, T.: The pyramid match kernel: discriminativeclassification with sets of image features. In: IEEE InternationalConference on Computer Vision (ICCV) (2005)

8. Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization andrecognition of indoor scenes from rgb-d images. In: IEEE Inter-national Conference on Computer Vision and Pattern Recognition(CVPR) (2013)

9. Hofmann, T.: Unsupervised learning by probabilistic latent seman-tic analysis. Mach. Learn. 42(1–2), 177–196 (2001)

10. Shotton, J., Johnson, R.C. M.: Semantic texton forests for imagecategorization and segmentation. In: IEEE International Confer-ence on Computer Vision and Pattern Recognition (CVPR) (2008)

11. Kwitt, R., Vasconcelos, N., Rasiwasia, N.: Scene recognition on thesemantic manifold. In: European Conference on Computer Vision(ECCV) (2012)

12. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spa-tial pyramid matching for recognizing natural scene categories. In:

IEEE International Conference on Computer Vision and PatternRecognition (CVPR) (2006)

13. Li, L.J., Su, H., Xing, E.P., Fei-Fei, L.: Object bank: A high-levelimage representation for scene classification and semantic featuresparsification. In: Neural Information Processing Systems (NIPS)(2010)

14. Li, X., Guo, Y.: An object co-occurrence assisted hierarchicalmodel for scene understanding. In: British Computer Vision Con-ference (BMVC) (2012)

15. Liu, D., Chen, T.: Unsupervised image categorization and objectlocalization using topic models and correspondences betweenimages. In: IEEE International Conference on Computer Vision(ICCV) (2007)

16. Liu, J., Shah, M.: Scene modeling using co-clustering. In: IEEEInternational Conference on Computer Vision (ICCV) (2007)

17. Liu, J., Shah, M.: Learning human action via information maxi-mization. In: IEEE International Conference on Computer Visionand Pattern Recognition (CVPR) (2008)

18. Liu, J., Yang, Y., Shah, M.: Learning semantic visual vocabular-ies using diffusion distance. In: IEEE International Conference onComputer Vision and Pattern Recognition (CVPR) (2009)

19. Lowe, D.: Object recognition from local scale-invariant features.In: IEEE International Conference on Computer Vision (ICCV)(1999)

20. Moosmann, F., Triggs, B., Jurie, F.: Fast discriminative visual code-books using randomized clustering forests. In: Neural InformationProcessing Systems (NIPS) (2006)

21. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning ofhuman action categories using spatial–temporal words. Int. J. Com-put. Vis. 79(3), 299–318 (2008)

22. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holisticrepresentation of the spatial envelope. Int. J. Comput. Vis. 42(3),145–175 (2001)

23. Felzenszwalb, P.F., Girshick, D.M.R.B., Ramanan, D.: Objectdetection with discriminatively trained part-based models. In: IEEETransactions on Pattern Analysis and Machine Intelligence (2010)

24. Pandey, M., Lazebnik, S.: Scene recognition and weakly supervisedobject localization with deformable part-based models. In: IEEEInternational Conference on Computer Vision (ICCV) (2011)

25. Parizi, S., Oberlin, J., Felzenszwalb, P.: Reconfigurable models forscene recognition. In: IEEE International Conference on ComputerVision and Pattern Recognition (CVPR) (2012)

26. Quattoni, A., Torralba, A.: Indoor scene recognition. In: IEEE Inter-national Conference on Computer Vision and Pattern Recognition(CVPR) (2009)

27. Quelhas, P., Monay, F., Odobez, J.m., Gatica-perez, D., Tuytelaars,T., Van Gool, L.: Modeling scenes with local descriptors and latentaspects. In: IEEE International Conference on Computer Vision(ICCV) (2005)

28. Saghafi, B., Farahzadeh, E., Rajan, D., Sluzek, A.: Embeddingvisual words into concept space for action and scene recognition.In: British Machine Vision Conference (BMVC) (2010)

29. Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: European Conference on Com-puter Vision (ECCV) (2012)

30. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.:Discovering objects and their location in images. In: IEEE Inter-national Conference on Computer Vision (ICCV) (2005)

31. Vogel, J., Schiele, B.: Natural scene retrieval based on a semanticmodeling step. In: ACM International Conference on Image andVideo Retrieval (CIVR) (2004)

32. Wu, J., Rehg, J.M.: CENTRIST: a visual descriptor for scene cat-egorization. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1489–1501 (2011)

123

Author's personal copy