Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Latent Pyramidal Regions for Recognizing Scenes
Fereshteh Sadeghi, Marshall TappenUniversity of Central Florida
Orlando, Floridafsadeghi,[email protected]
AbstractIn this paper we propose a simple but efficient image
representation for solving the scene classification problem.Our new representation combines the benefits of spatialpyramid representation using nonlinear feature coding andlatent Support Vector Machine (LSVM) to train a set of La-tent Pyramidal Regions (LPR). Each of our LPRs capturesa discriminative characteristic of the scenes and is trainedby searching over all possible sub-windows of the images ina latent SVM training procedure. The final response of theLPRs form a single feature vector which we call the LPRrepresentation and can be used for the classification task.We tested our model in three datasets with a variety of scenecategories (15-Scenes, UIUC-Sports and MIT-indoor) andobtained state-of-the-art results.
1. IntroductionIn [6], we propose a new approach for representing im-
ages for image classification and particularly scene recog-nition. Our work is inspired by the success of latent vari-able approaches [4, 1] and Spatial Pyramids (SP) represen-tation [2, 7]. The spatial pyramid representation can capturethe spatial aspects of images however its ability to modelimages is limited by the fixed grid. In our model a set ofregion detectors are learned. Each region is represented bya spatial pyramid and trained in a latent SVM frameworkto be flexible for capturing the key characteristics of thescenes. Our model also has similarities with [4] where thedeformable object detector [1] is utilized to classify sceneswithout requiring human-segmented regions. The key dif-ferences of this work with [4] lie in how the models are con-structed. Responding to the varied appearance of scenes,our model removes spatial constraints and focuses on find-ing characteristic image regions. Also, we separate localiza-tion of key regions from the scene categorization. This al-lows the classifier to optimize the weights for distinguishingbetween classes without having to balance how the weightvalues will affect which image regions are chosen.1.1. The Latent Pyramidal Region Representation
We propose a new image representation designed for dis-criminating between image classes. In our new representa-tion, each feature value expresses a particular type of sceneregion that is present in the images of one category. To make
this representation robust to different spatial configurations,the position of each scene region is treated as a latent vari-able that is optimized as part of the representation. To cap-ture the structure within a region, each region is representedwith a spatial pyramid which we refer to as Latent Pyrami-dal Regions and we refer to this representation as the LatentPyramidal Regions(LPR) representation. The fundamentalunit in the LPR is an image region detector that is param-eterized to find image regions with a specific appearance.Given an input image I , the vector ~v will denote the LPRrepresentation of the image I .
Each element in the vector ~v is computed by finding themaximum response of a cost function applied to differentsub-windows in the image. If vi is the ith element of ~v, weformally denote it as
vi = maxw∈I
θ>i~f(I, w), (1)
where ~f(I, w) is a function that returns a vector of featuresextracted from sub-window w in the image I . This maxoperation occurs over the set of all possible sub-windowsin the image and thus w is the latent variable of our model.We represent these regions using the coding scheme pro-posed in [7]. The vector θi is a set of parameters that defineswhat type of image region each detector selects for and aretrained discriminatively based on one-versus-all training ofa structural Latent SVM.
The underlying idea behind the training process is tobuild the set of region detectors optimized for separatingeach scene category from the others. A detector defined byparameters θk is created by first choosing a particular scenecategory k. Each training image, I can then be assigned alabel y ∈ {−1,+1}, with y taking the label +1 if the I be-longs to category k. Otherwise, y takes the value −1. Thegoal in training is to learn a prediction rule of the form:
Fθ(I) = argmaxk,w
[θ>k~f(I, w)], (2)
where k will be the predicted label and w will be the sub-window with the highest detection score. As in Eq. (1), thefunction ~f(I, w) evaluates to a vector of features extractedfrom sub-window w. The parameter vector θ is found byminimizing the cost function:
f(θ) =λ
2‖θ‖2 +
N∑j=1
Rj(θ), (3)
4321
Method AccuracyLLC(baseline) 80.57LPR-MS(our approach) 83.29LPR-LIN(our approach) 85.72LPR-RBF(our approach) 85.81
Table 1. The average per-class accuracy results on the 15-Scenes dataset.
Method AccuracyLLC(our global term) 81.87LPR-MS(our approach) 85.0LPR-LIN(our approach) 85.2LPR-RBF(our approach) 86.25
Table 2. The average per-class accuracy results on UIUC-Sports dataset.
with λ balancing between the quadratic regularizer ‖θ‖2and the risk function Rj(θ), which is summed over the Ntraining images. The risk functionRj(θ) is structured to pe-nalize the prediction function when it predicts an incorrectlabel (see [6] for more details).
2. ExperimentsThe performance of the proposed method is evaluated
on three scene datasets with diverse types of scenes (15-Scenes [2], UIUC-Sports [3], (MIT-Indoor [5]). We reportthree key results, in addition to results of previous work:• The accuracy computed using a linear SVM combined
with the spatial pyramid representation of the imageusing a locality-constrained coding(LLC) [7].• LPR-MS is the accuracy computed using the maxi-
mum response of region detectors associated with eachclass. If we denote Vk as the set of all region detectorstrained to respond to class k, the classification score iscomputed by summing the response of those detectors.This is expressed as y = argmax
k∈K
[∑i∈Vk
vi], where
there are K possible classes.• LPR-RBF and LPR-LIN are the accuracy computed by
RBF kernel SVM and Linear SVM using the LPR rep-resentation.
2.1. Key ResultsWe wish to highlight the following key results:• While the LLC representation and LPR representation
use the exact same descriptors and coding scheme, theLPR representation outperforms LLC.• The LPR representation outperforms other single fea-
ture accuracy results as well as he deformable partmodel [4]. When other systems outperform LPR, thisrequires the fusion of multiple features.
References[1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-
manan. Object detection with discriminatively trained part-based models. IEEE Trans. PAMI, 32(9):1627 –1645, 2010.
[2] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of fea-tures: Spatial pyramid matching for recognizing natural scenecategories. In CVPR, 2006.
Method AccuracyDPM [4] 30.4DPM+KSPM+GIST-color [4] 43.1LLC(our global term) 37.32LPR-MS(our approach) 41.22LPR-LIN(our approach) 44.84LPR-RBF(our approach) 44.41
Table 3. The average per-class accuracy results on the MIT-indoor dataset.
mov
ieth
eate
ro
pe
ratin
g r
oo
mst
air
case
gym
livin
gro
om
polo
saili
ngbe
droo
mM
ITm
ount
ain
CA
Lsub
urb
MIT-indoor
15-Scenes
UIUC-Sports
Figure 1. The detected regions found by LPR in. The first five columnsshow the characteristic regions found by LPR. The last column is an ex-ample of inappropriate LPR region selection.
[3] L.-J. Li and L. Fei-Fei. What, where and who? classifyingevent by scene and object recognition. In ICCV, 2007.
[4] M. Pandey and S. Lazebnik. Scene recognition and weakly su-pervised object localization with deformable part-based mod-els. In ICCV, 2011.
[5] A. Quattoni and A. Torralba. Recognizing indoor scenes. InCVPR, 2009.
[6] F. Sadeghi and M. F. Tappen. Latent pyramidal regions forrecognizing scenes. In ECCV, pages 228–241, 2012.
[7] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.Locality-constrained linear coding for image classification. InCVPR, 2010.
4322