[IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and

Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July, 2012

A SALIENT HIERARCHICAL MODEL FOR OBJECT RECOGNITION

WEI-BIN YANG, BIN FANG, ZHAO-WEI SHANG, BO LIN

School ofComputer Science, Chongqing University, Chongqing, ChinaE-MAIL: [email protected]

Abstract:Image saliency attempts to describe the most conspicuous

part in an input image by mimicking human visual selectiveattention mechanism. Naturally, it could be adopted forimproving object recognition. To demonstrate theeffectiveness of saliency in object recognition, this paperproposes a salient hierarchical modeL First, the traditionalsaliency model is modified for more robust saliency estimation.Second, the visual saliency detection method is combined withthe Hierarchical Maximization model to provide more usefulvisual information for classification. Experimental resultsshow that the improved saliency model extracts more accurateconspicuity, and the proposed salient hierarchical modeloutperforms Hierarchical Maximization modeL

Keywords:Image saliency; visual cortex; hierarchical model; object

recognition

1. Introduction

To learn how humans look and recognize is animportant issue in computer vision and pattern recognition.Two main involving research topics are visual saliencydetection and object recognition. Mostly, we developrelated research work in different ways. However, sinceboth two tasks are inspired by human visual system andvisual cortex, it is reasonable to believe that the researchachievement may benefit each other. Therefore, a robustsaliency model and an effective combination may be thekey for attention based object recognition.

Visual saliency is believed to drive human fixationbehavior during free viewing by attracting visual attentionin a bottom-up way. Moreover, saliency also appears todetermine which details humans find interesting in visualscenes [1]. The most influential computational frameworkfor estimating visual saliency is proposed by Itti et al. [2],which implemented and further developed thephysiologically inspired saliency-based model of visualattention introduced by Koch and Ullman [3]. Itti's saliencymodel first computes feature maps for color, intensity andorientation using a center-surround operator across different

978·1-4673·1535·7/121$31.00 ©2012 Crown

scales, and then generates the saliency map bynormalization and summation on these feature maps.Achanta et al. [4] used features of color and luminance todetect salient region with well-defmed boundaries.Goferman et al. [5] presented a saliency detector bycomputing the dissimilarity between different imagepatches over four scales. Cheng et al. [6] proposed a globalmethod to detect visual saliency by measuring thedissimilarity between different image regions, whichobtained excellent performance on salient object detection.Hou et al. [7] proposed an image descriptor, denoted imagesignature, to approximate the foreground of an image usingthe Discrete Cosine Transform and Inverse Discrete CosineTransform.

In addition, based on our knowledge of visual vortex,many studies focus on biologically plausible method forobject class recognition. Recent work by Serre et al. [8]proposed a computational model (HierarchicalMaximization, HMAX) based on the feedforward path ofobject recognition in cortex that accounts for the first100-200 milliseconds of processing in the ventral stream ofprimate visual cortex [9]. HMAX model obtains promisingresults on some of the standard classification datasets.Mutch et al. [10] improved HMAX model by incorporatingsome additional biologically-motivated properties, such assparsity and localized intermediated-level features.

To prove visual saliency is useful for objectrecognition, Riesenhuber et al. [11] applied Itti's saliencymodel with SIFT descriptor. Han et al. [12] combinedattention and recognition by replacing the first layer of theHMAX architecture with a saliency network. In this paper,we attempt to provide a new view in another way. We usesaliency model to guide the learning process and to formthe principle of choosing training samples in HMAXmodel.

The rest of this paper is organized as follows. Section2 introduces the proposed salient hierarchical model indetail. Section 3 evaluates the performance of the improvedsaliency model and the proposed salient hierarchical model.Conclusions are given in Section 4.

244


2. Salient hierarchical model

Since image saliency attempts to extract the mostsalient part, we consider how it may be used to improveobject recognition. Based on the basic observation that thesalient part in an image reflects important features which isvery important for image classification, we combine visualsaliency with HMAX model. We discuss image saliencydetection and HMAX model in Section 2.1 and 2.2,respectively, and then describe the proposed salient modelin detail in 2.3.

the Euclidean distance between the color C. k and c· 1 inz, r.

CIE Lab color space. To reduce the computationalcomplexity, a quantization operation is necessary [6]. Eachchannel of the RGB color space is first quantized to 12different values, and then frequently occurring colors whichcover more than 95% of the image pixels are chosen. Afterthat, we transform RGB color space into Lab color spacefor further computation according to Eq. 2.

The weighted spatial distance o,(1i ,rj ) is defined

2.2. HMAX model

HMAX model [8, 9] attempts to summarize a core ofwell-accepted facts about the ventral stream in the visualcortex in a quantitative way. In its simplest form, the model

where Ci,k and Cj,1 is the center of region 1i and

r j respectively, II c., - Cj,1 II is the Euclidean distance

between two centers, and 8 controls the strength ofspatial distance which is set to be 0.4 as in [15, 6].

The central bias [16] proves that observers pay mostattention to the center of the image, and is extensivelyapplied in visual saliency estimation [15, 17]. Todemonstrate the location prior and emphasize more on thecenter of the visual field, we employ a two dimensionalanisotropic Gaussian function to describe how it conductthe saliency,

!(r)=exp{_[(Xc-XJ2

+ (Yc-YJ2

) } (4)Z 28? 282

x y

where (xc'Yc) is the center of region 1i, (xo'Yo)

is the center of the visual field, and 28: and 28: are

variances along the two directions respectively. In all of our

experiments, 8: is set to be O.SW;m and 8: is set to

be O.SHim, where W;m and Him are the width and

height of the image respectively.Therefore, a saliency map could be obtained by Eq. 1.

We add some top-down information to provide moredistinct features of the object. In this paper, we take the facedetection into account. In Section 2.3 we describe howimage saliency improves visual recognition.

2.1. Image saliency detection

A given image I is first segmented into a set of

regions R, {1j,r2 , ••• ,rM } , by a graph-based image

segmentation algorithm [13]. For each region 1i, the

saliency value Sr.. is estimated as follows,,M

S,; =!(1i)Lm(1i)Dc(1i,1j)Ds(1i,rj ) (1)j=l

where ! (1i ) is the central bias, M is the total

number of regions, m(1i ) is the weight of region 1i,

D; (1i, rj ) is the color dissimilarity between regions 1i

and r., and D, (1i ,rj ) is the spatial distance between

regions 1i and rj • We set m(1i) to be the number of

pixels in region 1i to emphasize contrast to bigger regions,

which follows the principle that human pays more attentionto big salient objects than small ones [14].

We utilize a popular method [15, 6] to measure thecolor contrast and the spatial distance. The defmition of the

color contrast D; (1i, rj ) is as follows:

IV; ».Dc(1i,1j) =LLPc;,tPcjJ II Cj,k -Cj,l II (2)

k=l 1=1

where N, and N j is the number of colors in

region 1i and r j respectively, Ci,k is the k-th color in

region r, c· 1 is the l-th color in region r., p is thez r. } Ci .lt

probability of the color Ci,k among all N, colors in

region r, p. is the probability of the color C. 1z c},l r,

among all N j colors in region r., and II Ci,k - cj,1 II is

as:

D ( ) = (II G,k -Cj,/ll)r.,r. exps Z } 8 (3)

245


consists of four layers of computational units S1, C1, S2and C2, where simple S units alternate with complex Cunits.

S1 units take the form of Gabor functions, which havebeen shown to provide a good model of cortical simple cellreceptive fields [18] and are described as follows

F(x,y) = exp( (x~;;y~) )XCOS(2; xJ,s.t. (5)

Xo=xcosB+ ysinB and Yo = ysinB+ ycosB.All filter parameters, i.e., the aspect ratio, r =0.3, the

orientation B, the effective width a , the wavelength Aas well as the filter sizes s were adjusted so that thetuning properties match the bulk of VI simple cells. Serreet al. [8] arranged the S1 filters to form a pyramid of scales,spanning a range of sizes from 7 x 7 to 37 x 37 pixelsin steps of two pixels. To keep the number of units tractable,four orientations (0°,45°,90°,135°) are considered,

thus leading to 64 different SI receptive field types total (16scales multiply 4 orientations).

Cl units pool over organized afferent S1 units from theprevious layer with the same orientation and from the samescale band. The scale band index of the S1 units also

determines the size of the S1 neighborhood N s x N s over

which the Cl units pool. This process is done for each ofthe four orientations and each scale band independently.

S2 units pool over afferent Cl units from a localspatial neighborhood across all four orientations. Each S2unit response depends in a Gaussian-like way on theEuclidean distance between a new input and a storedprototype. That is, for an image patch X from theprevious

Cl layer at a particular scale, the response of thecorresponding S2 unit is given by:

response = exp(-p II X -~ 112

) (6)

where p defmes the sharpness of the tuning and ~

is one of the N features.C2 units take a global maximum over all scales and

positions for each S2 type over the entire S2 lattice. Theresult is a vector of N C2 values, where Ncorresponds to the number of prototypes extracted duringthe learning stage.

The learning process corresponds to selecting a set of

N prototypes r: for the S2 units. They are extracted

randomly from a large pool ofprototypes of various sizes atthe level of the C1 layer across all four orientations.

2.3. Salient hierarchical model

Since visual saliency and HMAX model are both

.:'\Igorithnl t Extracting C1 prototypes for the learning stageInput:

nSrZof , 1l1\ Tu;1n bel' , t'~ ll~, t;Output:

The master saliency map S ~

I: for i = O~ i <m.Si:«, i + + do2: for j = 0, j < n.Number, j + + do3: Obtain the corresponding saliency map S by Eq. 1;4: Binarize EJ by the threshold t', and resize S to a

randomly selected Cl unit;5: if r (Eq, 7) < 'U' then6: goto 13~

7: else if T < 2u 1 then8: Dilate S with a 3 x 3 square operator:9: end if

10: if j < t * n1\lumbe r then11: Randomly extract Cl prototype in the corre

spendingwhite (salient) region of S~10

) . else

13: Randomly extract Cl prototype in the whole u-nit;

14: end if15: end for16: end for17: return n.Size x 11Number C1prototypes.

inspired by human visual system, we naturally consider thatthey could be combined in an effective way to improveobject recognition. As described in Section 2.1, we estimatethe image saliency by combination central bias, color andspatial dissimilarity. We denote the saliency map of theinput image, S. Similar to HMAX model, the proposedsalient hierarchical model is composed of two key steps, thelearning stage and the classification stage. To prove theeffectiveness of visual saliency, we utilize the saliency mapto generate sample patches which are randomly selectedfrom Cl units in the learning stage.

The whole procedure of extracting Cl prototypes forthe learning stage is presented in Algorithm 1. First, for arandomly selected Cl unit, we binarize the corresponding

saliency map S by a threshold v. Then S is rescaledto the size of the selected Cl unit. We determine where theCl prototype should be extracted in the Cl unit by theconspicuous part in the saliency map S. To avoid theinfluence of the inadequate conspicuous part, we make a

246


dilation operation on the map S only if the ratio rsatisfies the condition w < r < 2w, where W is anexperimental threshold and r is defmed as

~X;V~r= (~

HxWwhere Hand Ware the height and width of S,

respectively, and Si is the value (0 or 1) in the i-th

location in S. Finally, we extract C1 prototypes by thefollowing principle:

Randomly extracting in the salient region of S

if j < (t xexpected number) (8)

Randomly extracting in the whole C1 unit

otherwise

where j is the current number of extracted C1

prototypes and t is a probability to determine whether theC1 prototype is extracted in the salient region of S. Ift = 0 , the model falls back to HMAX model, and if t = 1,all C1 prototypes may be limited to the salient region,which may cause similar sample patches from C1 so longas the saliency map fails to extract the object part.Therefore, we set t experimentally and we discuss that inSection 3.

3. Experiments and discussion

3.1. Image datasets

To evaluate the improved saliency model, we considerthe public dataset in [4], denoted MSRA, which contains1000 color images with accurate pixel-wise object-contoursegmentations, selected from a 5000 images dataset [19].Furthermore, to demonstrate the performance of theproposed salient hierarchical model, we consider 10 imagedatasets which contains images, i.e. leaves (187 images),cars_brad (1155 images), faces (452 images), airplanes(1076 images), motorbikes (828 images), bottles (247images), camels (356 images), guitars (1031 images),houses (1001 images) and cars_markus (127 images),denoted CalTech1

• Some sample images are shown in Fig.1.

3.2. Results

To prove the improved saliency model, we test ouralgorithm on MSRA dataset. We compare our model with

1Available at http://www.robots.ox.ac.uk/.....vgg/data/data-cats.html.

four state-of-the-art methods, denoted FT [4], CA [5], RC[6] and IS [7], respectively. Parameters setting are specifiedin Section 2.1. The results are shown in Fig. 2. Obviously,our

Figure 1. Sample images from MSRA and Cal'Ieeh,

method generates closer saliency maps to ground truth andextracts more conspicuous part than the other saliencymodels. It is believed that our saliency model may extractmore salient features for improving object recognition.

For fair comparison, the proposed salient hierarchicalmodel keeps the same parameters setting in HMAX model

[8]. In the learning stage, we select 1000 C1 prototypes r:for S2 units. That is, r: contains n x n x 4 C1 elements,

where n =4,8, 12,16, so nSize =4. We randomly

extract nNumber = 250 patches for each size in C1units, and hence nNumber = 250 x 4 = 1000Furthermore, for simplicity, we extract these C1 prototypesonly in the band with filter size 11x 11 and 13x 13 .The thresholds V and W are set to be 0.3 and 0.15,respectively. To choose a proper threshold t in Eq. 8, wemeasure our method through 4 independent runs by varyingthe threshold t. By evaluating the mean and standarddeviations of the accuracy, as shown in Fig. 3, we sett = 0.6. Since in that case, the average accuracy is the bestand the standard deviation is the least, which indicate a highand stable performance.

To compare the performance of HMAX model and theproposed salient hierarchical model (SlIM), we realize thefollowing experiments in 10 CalTech datasets. All images

247


are rescaled to be 140 pixels (the longer dimension betweenheight and width) and preserved the image aspect ratio. Alinear binary SVM classifier is applied for classifying thegenerated 1000 C2 features by HMAX and SlIM. We use30

Figure 2. Comparison with other saliency models (FT [4], CA [5],RC [6], IS [7]) on MSRA dataset.

TABLE 1. PERFORMANCE COMPARISON WITH HMAX MODEL ON 10CALTECH DATASETS. ACCURACY IS MEASURED BY THE MEAN ANDSTANDARD DEVIATIONS OF 8 INDEPENDENT RUNS USING ALL AVAILABLE

TEST IMAGES.

Datasets HlVlAX SHI\'Ileaves 0.91 28±O.0274 O.9258±O.O229cars.brad O.9569±O.0228 O.9684±O.OO74faces o.9217±O.0285 O.9361±O.OI75airplanes O.9280±O.O266 O.9426±O.O208Illotorbikes O.9178±O.O388 0.9221 ±O.O170bottles 0.6521 ±O.0984 O.7201±O.O302camels 0.6813±O.0703 O.7181±O..O276guitars O.7714±O.O508 O.8179±O.O261houses O.8747±O.0423 0.9001±O.O273cars.rnarkus 0.9115±O.0403 O.9227±O.O306

average 0.8471 ±O.0446 O.8777±O.O227

negative training examples and 30 positive trainingexamples. The 30 negative examples are constant in allexperiments, which are randomly selected from thebackground collection in the Cal'Iechl Gl dataset [20]. Fortesting, we test all available images in each dataset,excluding the training examples. For each dataset, wereport the mean and standard deviation of the performanceof 8 independent runs. As shown in Table 1, SHMoutperforms HMAX distinctly, not only with high averageaccuracy, but also with less (stable) standard deviations.

Since HMAX and SHM have similar experimental settings,the difference is that SlIM combines the result obtained byour saliency model. Therefore, it is reasonable to believethat visual saliency improves object recognition, and ourproposed model demonstrates an

Figure 3. Performance on airplanes dataset by varying thethreshold t, Results are the mean and standard deviations of 4independent runs using all available test images.

effective combination method.

3.3. Discussion

We improve the saliency model in [6], and proposed asalient hierarchical model based on HMAX model forobject recognition. Both visual saliency and HMAX modelattempt to fmd out how humans look by mimicking humanvisual system, so we combine the two different tasks in aneffective way. Through the experimental results, we can seethat our salient hierarchical model outperforms HMAXmodel by applying similar parameters setting. Someparameters are just followed the setting in [8], and otherspecial parameters in our model may come fromexperiments, such as the threshold t in Eq. 8. In sum, theproposed salient hierarchical model may not be the bestmethod for image classification, but it provides a new viewon visual saliency based object recognition.

4. Conclusions

To learn how people look and recognize is an importantlong-term research topic in computer vision. In this sense,this paper mainly makes two contributions. First, weimprove the saliency model in [6]. Particularly, we utilizecentral bias theory and add top-down information (i.e. face

248


detection) to provide more robust saliency detection.Second, we propose a salient hierarchical model, whichmerges the saliency map into HMAX model for moreaccurate object recognition. The proposed model is a simpleapplication of visual saliency in object recognition, and theexperimental results demonstrate that the proposed model isrobust. We can conclude that visual saliency is helpful forobject recognition, as the similar work in [11, 12], and theproposed combination method is effective. Future workmay focus on finding out more robust combination method,and explaining the essential issue how visual attentionguides more accurate object recognition physiologically,psychologically and mathematically.

Acknowledgements

This work is supported by the Fundamental ResearchFunds for the Central University (CDJXS11182240 andCDJXSI0182216), and the National Natural ScienceFoundation ofChina (61173129, 61173130, 61103116).

References

[1] A. Toet, "Computational versus psychophysical bottom-upimage saliency: A comparative evaluation study," IEEETransactions on Pattern Analysis and Machine Intelligence,vol. 33, no. 11, pp. 2131-2146, 2011.

[2] L. Itti, C. Koch, and E. Niebur, "A model of saliency-basedvisual attention for rapid scene analysis," IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 20, no.II,pp. 1254-1259,1998.

[3] C. Konch and S. Ullman, "Shifts in selective visual attention:toward the underlying neural circuitry," HumanNeurobiology, vol. 4, no. 4, pp. 219-227, 1985.

[4] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk:,"Frequency-tuned salient region detection," in IEEEConference on Computer Vision and Pattern Recognition,2009,pp.1597-1604.

[5] S. Goferman, L. Zelnik-Manor, and A. Tal, "Context- awaresaliency detection," in IEEE Conference on ComputerVision and Pattern Recognition, 2010, pp. 2376-2383.

[6] M.M. Cheng, G.X. Zhang, N.J. Mitra, X. Huang, and S.M.Hu, "Global contrast based salient region detection," inIEEE Conference on Computer Vision and PatternRecognuion,2011,pp.409-416.

[7] X. Hou, J. Harel, and C. Koch, "Image signature:Highlighting sparse salient regions," IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pp.194-201,2012.

[8] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T.Poggio, "Robust object recognition with cortex-likemechanisms," IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 29, no. 3, pp. 411--426,2007.

[9] M. Riesenhuber and T. Poggio, "Hierarchical models ofobject recognition in cortex," Nature Neuroscience, vol. 2,no. II,pp. 1019-1025,1999.

[10] J. Mutch and D.G. Lowe, "Object class recognition andlocalization using sparse features with limited receptivefields," International Journal of Computer Vision, vol. 80,no. l,pp. 45-57, 2008.

[11] U. Rutishauser, D. Walther, C. Koch, and P. Perona, "Isbottom-up attention useful for object recognition?," in IEEEConference on Computer Vision and Pattern Recognition,2004, vol. 2, pp. 37--44.

[12] S. Han and N. Vasconcelos, "Biologically plausible saliencymechanisms improve feedforward object recognition,"Vision research, vol. 50, no. 22, pp. 2295-2307,2010.

[13] P.F. Felzenszwalb and D.P. Huttenlocher, "Efficientgraph-based image segmentation," International Journal ofComputer Vision, vol. 59, no. 2, pp. 167-181,2004.

[14] Y. Lin, B. Fang, and Y. Tang, "A computational model forsaliency maps by using local entropy," in AAAI Conferenceon Artificial Intelligence, 2010, pp. 967-973.

[15] L. Duan, C. Wu, J. Miao, and A. Bovik, "Visual conspicuityindex: spatial dissimilarity, distance and central bias," IEEESignal Processing Letters, vol. 18, no.ll, pp. 690-693,2011.

[16] B.W. Tatler, "The central fixation bias in scene viewing:Selecting an optimal viewing position independently ofmotor biases and image feature distributions," Journal ofVision, vol. 7, no. 14, pp. 1-17,2007.

[17] T. Judd, K. Ehinger, F. Durand, and A. Torralba, "Learningto predict where humans look," in International Conferenceon Computer Vision, 2009, pp. 2106-2113.

[18] J.P. Jones and L.A. Palmer, "An evaluation of thetwo-dimensional Gabor filter model of simple receptivefields in cat striate cortex," Journal ofNeurophysiology, vol.58,no.6,pp. 1233-1258,1987.

[19] T. Liu, J. Sun, N. Zheng, X. Tang, and H.Y. Shum,"Learning to detect a salient object," in IEEE Conference onComputer Vision and Pattern Recognition, 2007, pp. 1--8.

[20] L. Fei-Fei, R. Fergus, and P. Perona, "Learning generativevisual models from few training examples: An incrementalBayesian approach tested on 101 object categories,"Computer Vision and Image Understanding, vol. 106, no. 1,pp. 59-70,2007.

249

Documents

[IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and