Fast Learning Discriminative Dictionaries for Large-scale

Fast Learning Discriminative Dictionaries forLarge-scale Visual Recognition

Tianyi ZhaoDepartment of Computer Science

University of North Carolinaat Charlotte,Charlotte,NC 28223

Yanyun QuComputer Science Department,

Xiamen University,Fujian, China

Jianping FanDepartment of Computer Science

University of North Carolinaat Charlotte,Charlotte,NC 28223

Abstract—In this paper, we aim at improving the discrim-inative jointly dictionaries for large-scale image classification.Sparse representation is a popular tool for image classification.Visual dictionary is very critical to the classification performance.A visual tree is constructed according to the visual similarity,in which the higher layer represents the coarser membershipand the lower layer represents the finer membership. Jointlydictionary is learned according to the visual tree. Bregmaniterative algorithm is implemented to solve the optimal problemof joint dictionary learning, which makes the solution accurateand the running speed fast. Furthermore, we try to implementthe pre-trained features learned from the convolution neuralnetwork (CNN) to represent an image, and the residual error ofthe sparse representation is utilized for image classification. Theexperimental results demonstrate that the CNN feature is moredistinct than SIFT, and the hierarchical classification frameworkwith the Bregman iteration algorithm can greatly improve theperformance of classification.

I. INTRODUCTION

Nowadays, image data are rising in an explosive way. Forexample, most popular social networks, such as Facebook,Instagram, Flickr and so on, generate thousands of millions ofimages every day into the Internet.Thus, image classificationattracts more and more attentions. The most challenging taskis the large scale image classification. Different from thetraditional image classification, large scale image classificationimplies that the number of the classes is huge and the numberof images is huge. Traditional image classification usually canclassify from several image classes to several hundreds ofimage classes. A flat classification method is popular to beused, that is, the one-vs-one classifier is used or the one-vs-all classifier is used. For the large scale image classification,there are usually thousands of image classes. If we takethe one-vs-one or one-vs-all classifiers, it will be very timeconsuming and it will be prohibited in practice. Thus, in thispaper, hierarchical classification framework is adopted. For aquery image, not all classifiers are utilized, but only parts ofclassifiers are used during classification. It will greatly improvethe running speed.

Recognition accuracy and computational efficiency are thetwo criteria which are used to evaluate the performance ofmethods. The recognition accuracy depends on the methods offeature extraction and classification. Computational efficiencydepends on the hierarchical organization.

Bag of words technology is widely used in image classifica-tion.There are three critical factors: 1) local feature extraction,2) visual dictionary construction and coding, 3) the classifierdesign. Local feature extraction is the prerequisites for imageclassification, the more discriminative the features are, and thebetter the performance of image classification is. Most of thepopular local features are engineered well, such as SIFT[5]and HOG [4]. The shortage of the man-crafted features maybe inflexible and indistinct. In view of the visual dictionary andencoding, K-means method is the traditional encoding method.Because K-means is an unsupervised method, the visualdictionary is short of discrimination. Thus, many researcherspaid more attentions to encode the discriminative features.Yang et al. [18] use sparse code to encode the features, andZhou et al.[34] improved the sparse code based on the jointlydictionary learning. However, the algorithms used by thementioned dictionary learning is inefficient. For the classifierdesign, most of the methods paid more attention to the binaryclassifier, such as SVM, random forest tree, adaBoost and soon. The multi-class classifier is constructed based on the binaryclassifier. In this paper, both the classifier and the classificationscheme should be considered to be suitable to the hierarchicalclassification framework, which is seldom discussed before.

In this paper, we propose our hierarchical classificationmethod in order to avoid the three mentioned shortages ofthe state-of-the-art methods based on BoW technology. Forfeature extraction, the pre-trained features learned from theConvolution Neural Network (CNN) are used to represent animage. Deep learning method is a rising hot topic. Motivatedits success in speech recognition and text classification, we tryusing CNN features for hierarchical image classification. Fordictionary learning, we use joint dictionary learning methodbased Bregman iterative algorithm, which can improve thecomputational speed of the optimal problem of sparse repre-sentation. For the classifier, we use the residuals of the sparserepresentation of an image to decide its label.

The rest of this paper is organized as follows. In Section2 we review the most relevant works. In Section 3 weintroduce the JDL algorithm. In Section 4 we present the `1-norm optimization algorithm for dictionary learning. Level-wise classification method is described in Section 5. The

978–1–4673–7478–1/15/$31.00 c© 2015 IEEE

experimental setup and results are given in Section 6. Weconclude in Section 7.

II. RELATED WORK

There are two types of hierarchical learning approaches:taxonomy related methods and taxonomy unrelated methods.Li et al.[1] constructed the ImageNet according to WordNet[9]. After that, WordNet is often used for large scale imageimage classification [10],[7],[13]. Visual information is verycritical for image classification. In many literatures visualcues are used to learn a hierarchical structure for imageclassification [20],[19]. Sivic et al. [14] used a hierarchicallatent Dirichlet allocation (hLDA) to discover a hierarchicalstructure from unlabeled images, which is helpful for imageclassification and image segmentation. Bart et.al. [6] utilizedan unsupervised Bayesian model to learn a tree structure toorganize image collections. Both the two methods are tested onthe moderate scale dataset, and their performances are not clearon the large scale image datasets. Moreover, they just aimed atimage classification but not giving a visualization result. Someresearchers [16] built the hierarchical model based on theconfusion matrix, which is obtained or computed by the outputof image categorization or object classification by using N one-vs-rest SVM classifiers. Griffin et al. [32] constructed a binarybranch tree to improve visual categorization. Bengio et al. [27]made a label embedding tree for multi-class classification.Liu et al. [8] made a probabilistic label tree for large-scaleclassification.

III. HIERARCHICAL JOINT DICTIONARY LEARNING

Dictionary learning usually train a single dictionary for eachdataset by the minimization of the residual error. For large-scale image recognition application, due to the strong inter-category relationship in the hierarchical structure of the visualtree, it’s typically appropriate to jointly learn dictionary. Firstwe initialize the individual dictionary for each ıth categoryindependently:

minDi,Ai

||Xi −DiAi||2F + λ‖Ai‖1 (1)

where Ai is the sparse code of the original dataset Xi[24].Then we partition the dictionary {Di} into two parts:Di =[D0, D̂i]. The first part D0, called common dictionary, rep-resents the common visual properties in the group. D̂i isthe class-specific dictionary for ıth category. Similar to singledictionary learning, we minimize the residual error for D0:

min{D0,D̂i,Ai}

∑i

{||Xi − [D0, D̂i]Ai||2F + λ‖Ai‖1

}(2)

Both Equ.(1), Equ.(2) can be solved by most existing `1-minimization algorithms effectively.

Fig. 1. The hierarchical visual tree

IV. BREGMAN OPTIMIZATION METHOD FOR `1REGULARIZED PROBLEMS

Bregman split method[31] is firstly used in imagerestoration[29], [15], and they are unreasonably successful inbetter regularization quality. These state-of-the-art methods arebased on the definition of Bregman distance[3]. The Bregmandistance on function J is defined as Equ.(3)

DpJ(u, v) = J(u)− J(v)− < p, u− v >, p ∈ ∂J(u) (3)

Bregman distance is not as institutional as Euclidean dis-tance, but it really can measures the closeness between u andv. At Bregman iteration, the Bregman original form is

uk+1 = argminu

DpJ(u, u

k) + +1

2‖b−Au‖2F (4)

pk+1 = pk +AT (b−Auk+1) (5)

We initiate Equ.(5) with p0 = 0, u0 = 0 and J(u) is the `1regularizer. The original `1-regularized formula can be reducedby Equ.(4) with the input as Equ.(6).

bk+1 = b+ (bk −Auk+1), b0 = 0 (6)

The Bregman iterative algorithm is defined in Algorithm 1.To solve step 3 in Algorithm. 1, we use TwIST[12], which

Algorithm 1 Bregman Iteration1: Initialize:b1 = b2: repeat3: uk+1 = argmin

uµ ‖u‖1 +

12‖b

k −Au‖2F4: bk+1 = b+ (bk −Auk+1)5: until convergence or after certain rounds

is fast on account of the simple implementation of algebraoperations. The difference comparing to using TwIST directlyon equation (1) is that we can set a lower accuracy for the

Fig. 2. The dictionary assignment scheme of the visual tree.

subproblem, but at the end of the iteration the solution canreach to a higher accuracy which is near the machine precision.The overall procedure of discriminative dictionary learning isspecified in Algorithm. 2, We solve the Equ.(2) by minimumthe single `1-regularized problem class by class.

Algorithm 2 Discriminative dictionary learning1: repeat2: Initialize {Di} and {Ai} independently3: For each class ı, update Ai by optimizing Equ.(1) using

Bregman iteration algorithm, considering Di is fixed4: For each class ı, update Di by solving minDi

‖Xi −DiAi‖2F

5: until convergence or after certain rounds6: repeat7: Initialize {D0}.Set D̂i = [D0, Di]8: For each class ı, update Ai by optimizing minAi

‖Xi−[D0, D̂i]Ai‖2F + λ‖Ai‖1 using Bregman iteration

9: For each class ı, update Di by solving minDi‖Xi −

D0A0i − D̂iAi‖2F

10: Update D0 by solving minD0 ‖X0 −D0A0‖2F

11: until convergence or the number of iteration reaches thethreshold

V. LEVEL-WISE CLASSIFICATION SCHEME

In the hierarchical framework, a new image is classifiedfrom top to bottom. At each level, the image is assigned tothe most-like node by the node classifier, until it reaches theleaf node.

A. Visual Tree

The label tree method[30] is the efficient clustering strategyto build the visual tree. We assume that in our large-scaleimage database, there are N categories, C1, C2, ...CN . In ıthcategory,the CNN feature vectors of images are representedas {xit}

Iit=1. There are Ii images in ıth category. The visual

feature for ıth category is presented as the average feature ofall images in it.

Xi =1

M

Ii∑t=1

xit (7)

To use the spectral clustering strategy, we have to constructthe affinity matrix first. The affinity matrix M ∈ RN×N

and it contains all the pair-wise similarity measures betweeneach two categories. Here, we give the similarity measurementbetween ıth and th category as Equ.(8). The formula stronglydepends on the Euclidean distance of two average features.

S(i, j) = exp(||Xi −Xj ||2

δ2ij) (8)

δij is the self-tune parameter according to [33]. The visualtree is shown as Fig 1. Based on the visual tree, in thispaper, by using the all CNN features as train data, we trainedSVM classifier on each non-leaf node to realize the level-wiseclassification.

B. Hierarchical Classification Scheme

1) Hierarchical Dictionary Distribution Scheme: Accord-ing to JDL, we can get a common dictionary in each group.

Hierarchical dictionary distribution scheme uses the commondictionary of each group to compose the dictionary of thehigher-level group. For example, the visual tree of depth 3is shown in Fig 2. Each individual dictionary is trained byCNN features in each category. The dictionary in Level 2 isconcatenated by the individual dictionaries in the group. Thenwe can get the common dictionary in each node in Level 2.According to the hierarchical dictionary distribution scheme,the common dictionaries are concatenated together to form thedictionary for the root node.

2) Classification Strategy: There are two usages of the dic-tionary on each node. First for the level-wise SVM classifier,we can project the CNN features on the dictionary, and usethe sparse representation vectors as the input of SVM model.Second the sparse representation based classifier(SRC)[25] canbe used. SRC is widely used in image classification. It hasachieved good result in [11],[22], and the promising resultshave been reported in [23],[26]. As shown in Equ.(1), thesparse code sort of represent the relationship between thecategory and the image. Sparse based representation classifiercompare the relationship by comparing the residual error ofeach sparse code, as Equ.(9).

identity(x) = argmini||Xi −DiAi||2F (9)

where Xi is the CNN feature, Di is the dictionary in ıgroup, A is the sparse code.

VI. EXPERIMENTS

We implement our method on two widely used datasets Ox-ford flower dataset and ILSVRC2012 dataset. Oxford flowerdataset includes 17 classes and 80 images for each class.ILSVRC2012 data set contains 1000 categories and each classcontains more than 1000 images each. First we use the Oxfordflower dataset to verify the efficiency of Bregman optimizationmethod. We regard the whole Oxford flower dataset as onegroup, and learn discriminative dictionaries containing bothcommon dictionary and class-specific dictionary for each class.Then we further evaluate our method on large-scale dataset,ImageNet ILSVRC2010.

A. The Evaluation On Oxford Flower Dataset

We firstly compare our approach with ScSPM [17], LLC[21], JDL [34] on Oxford flower dataset. We use the SIFTdescriptor from 16x16 patches with a step size of 6 at 3scales, and the `2 norm of each SIFT descriptor is normalizedto be 1. The sparsity parameter λ is set to be 0.15. Aftersparse coding, the spatial pyramid feature[28] is computedby max pooling with a three-level spatial pyramid partition.These features are inputs for SVM classifier. We set the sizeof individual dictionary to be 256 for each class, and 95 forcommon dictionary. We show the results in Table. I. The singledictionary size for ScSPM with Bregman is 256x17.

Fig. 3. Accuracy comparison on ILSVRC2012

B. The Evaluation On ILSVRC2012 Dataset

1) CNN Feature: The CNN model we used in this paper isa pre-trained model learned by [2]. It contains 5 convolutionlayers and two fully-connection layer. There are two response-normalization layers following the first and second convo-lutional layers. Max-pooling layers follow both response-normalization layers and the fifth convolutional layers. TheReLU nonlinearity is used to model the neuron’s output.In detail, the filters of first convolutional layer filters the224x224x3 input image with 96 kernels of size 11x11x3.The filters of second convolutional layer filter the outputof the first layer by 256 kernels of size 5x5x48. The thirdconvolutional layer contains 384 kernels of size 3x3x256.The fourth convolutional layer contains 384 kernels of size3x3x192. The fifth convolutional layer has 256 kernels of size3x3x192. The fully-connected layers have 4096 neurons. Weuse the output of the second fully-connected layers as outfeature vector, so the feature of each image is a vector of size4096.

2) Visual Tree: To evaluate the performance, a visual treewhose depth is 3 is used. The 1000 classes is divided into 83groups by spectral clustering. The class ’sorrel’ is a specialcategory that it’s the only class in its group. The tree structureis shown in Fig 1.

To clarify clearly, we set the level of the leaf nodes to be1, and the level is incremental layer by layer. At this time,we use the CNN feature to get a better performance whosedimension is 4096. We learn the individual dictionaries onthe 1000 classes [D1...D1000], then combine the individualdictionaries in each group to be the dictionary for the groupnode, such as {D1, ...Dci}

83i=1, ci is the last umber of class in

ıth group. The common dictionary {Di0}82i=1 learned in each

group is concatenated to be used for root classifier. As shownin Fig 2.

3) Classification Scheme: In this paper, we implement threekinds of classification method on our 83-group structure. Firstwe use SVM directly on CNN feature. Every node has itsown SVM. For the root node, we randomly pick 100 picturesfor each classes to train the SVM model.The first method iscalled SVM in Fig 3 and Fig 4 Second, compare to the firstone, we use the sparse representation of the CNN feature as the

TABLE IACCURACY COMPARISON ON OXFORD FLOWER DATASET. THE METHOD NAMES WITH ”*” IMPLY THAT BREGMAN ITERATIVE ALGORITHM IS USED.

ScSPM[17] 53.33 65.00 68.33 58.33 70.00 18.33 45.00 51.67 63.33ScSPM w/B∗ 61.35 82.45 83.20 63.94 74.50 33.16 73.91 69.38 79.64LLC w/B∗ 60.67 90.67 80.67 64.67 76.67 26.00 60.67 72.67 79.33JDL 75.00 95.00 81.67 58.33 70.67 35.00 60.23 78.33 85.00JDL w/B∗ 70.00 96.00 88.00 78.00 84.00 32.00 72.00 82.00 78.00

ScSPM[17] 58.33 58.33 50.00 38.33 70.00 43.33 20.00 58.33 52.35ScSPM w/B∗ 61.35 82.45 68.33 58.33 70.00 18.33 45.00 51.67 63.33LLC w/B∗ 80.00 74.67 41.33 56.67 80.67 63.33 29.33 75.33 65.49JDL 75.00 70.67 45.00 60.00 86.67 65.33 45.23 80.58 68.69JDL w/B∗ 80.00 76.00 42.00 64.00 88.00 78.00 34.00 78.00 71.76

Fig. 4. Accuracy in 83 group

input of the SVM model.This method is called DDL Third weimplement the sparse representation based classification(SRC)method.

Compare to JDL, the results of the 3 methods are shown inFig 3. Fig 4 also shows the result in each group of 83 of the3 methods.

VII. CONCLUSIONS

In this paper, we combine the state-of-the-art deep featureand joint dictionary learning method to get a better image fea-ture representation. Bregman iterative method is the efficient`1-norm optimazation method for joint dictionary learning.The hierarchical visual tree produced by clustering strategy,

provides the high correlation between categories in the samegroup, which is the foundation for JDL. The hierarchicaldictionary distribution scheme is used to make the large-scale image recognition problem affordable by using thecommon dictionary to compose the dictionary for higher level.Support vector machine(SVM) and sparse representation basedclassification(SRC) are two method used in this paper as level-wise classifier. We conduct the experiment on the most popularlarge-scale image dataset ImageNet, and the experimental re-sults demonstrate that our method is promising and superior tothe state-of-the-art hierarchical image classification methods..

VIII. ACKNOWLEDGMENT

The authors would like to thank the Large-scale objectrecognition research group in Xiamen University. The authorswould like to thank the reviewers for their insightful com-ments and suggestions which helped to make this paper morereadable.

REFERENCES

[1] http://www.image-net.org/.[2] Geoffrey E.Hinton Alex Krizhevsky, Hya Sutskever. Imagenet classifi-

cation with deep convolutional neural networks. In Proc.Advances inNeutral Information Processing Systems, 2012.

[3] L. Bregman. The relaxation method of finding the common points ofconvex sets and its application to the solution of problems in convexprogramming. Comput. Math. Math. Phys., 1967.

[4] Navneet Dalal and Bill Triggs. Histograms of oriented gradients forhuman detection. IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2005.

[5] DavidG.Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 2004.

[6] P. Perona E. Bart, I. Porteous and M. Welling. Unsupervised discoveryof visual taxonomies. IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2008.

[7] B. Triggs E. Nowak, F. Jurie. Sampling strategies for bag-of-featuresimage classification. Workshop on Statistical Learning in ComputerVision(ECCV), 2006.

[8] F. Jurie F. Moosmann, B. Triggs. Fast discriminative visual codebooksusing randomized clustering forests. Advances in Neural InformationProcessing Systems, 2007.

[9] Christiane Fellbaum. Wordnet an electronic lexical database. The MITPress, Cambridge, MA, 1998.

[10] L. Fan J. Willamowski C. Bray G. Csurka, C. Dance. Visual catego-rization with bags of keypoints. Workshop on Statistical Learning inComputer Vision(ECCV), 2004.

[11] and GuillermoSapiro Ignacio Ramirez, Pablo Sprechmann. Classicationand clustering via dictionary learning with structured incoherence andshared features. IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2010.

[12] M. Figueiredo J. Bioucas-Dias. A new twist: two-step iterative shrink-age/thresholding algorithms for image restoration. IEEE Transactionson Image Processing, 2007.

[13] H. Luo R. Jain J. Fan, Y. Gao. Mining multilevel image semantics viahierarchical classification. IEEE Transactions on Multimedia, 2008.

[14] A. Zisserman W.T. Freeman J. Sivic, B.C. Russell and A.A. Efros. Un-supervised discovery of visual object class hierarchies. IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2008.

[15] W. Gao J. Zhang, D. Zhao. Group-based sparse representation for imagerestoration. IEEE Transactions on Image Processing, 2014.

[16] Alexander C. Berg Jia Deng, Sanjeev Satheesh and Fei Fei F.Li. Fast andbalanced: Efficient label tree learning for large scale object recognition.In Proc.Advances in Neutral Information Processing Systems, 2011.

[17] Yihong Gong Jianchao Yang, Kai Yu and Thomas Huang. Linear spatialpyramid matching uisng sparse coding for image classification. IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2009.

[18] Yihong Gong Jianchao Yang, Kai Yu and Thomas S. Huang. Linearspatial pyramid matching using sparse coding for image classication.IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2009.

[19] Chunlei Yang Jianping Fan, Yi Shen and Ning Zhou. Structured max-margin learning for inter-related classifier training and multilabel imageannotation. IEEE Transactions on Image Processing, 2011.

[20] Ning Zhou Jinye Peng Jianping Fan, Xiaofei He and R. Jain. Quantitativecharacterization of semantic gaps for learning complexity estimation andinference model selection. IEEE Transactions on Multimedia, 2012.

[21] Kai Yu Fengjun Lv Jinjun Wang, Jianchao Yang and Thomas Huang.Locality-constrained linear coding for image classification. IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), 2010.

[22] Arvind Ganesh Shankar S.Sastry John Wright, Allen Y.Yang and Yi Ma.Robust face recognition via sparse representation. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI), 2009.

[23] Jean Ponce Guillermo Sapiro and Andrew Zisserman Julien Mairal,Francis Bach. Discriminative learned dictionaries for local imageanal-ysis. IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2008.

[24] Murray J.F. Rao B.D. Engan K. Lee T.-W. Sejnowski T. Kreutz-Delgado,K. Dictionary learning algorithms for sparse representation. NeuralComputation, vol. 15, no. 2, 2003.

[25] D. Zhang M. Yang, L. Zhang and S. Wang. Relaxed collaborativerepresentation for pattern classification. IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2012.

[26] X.Feng and D.Zhang M.Yang, L.Zhang. Fisher discrimination dictionarylearning for sparse representation. IEEE Conference on Computer Vision(ICCV), 2011.

[27] J. Li J.Z. Wang R. Datta, D. Joshi. Image retrieval: ideas, and trends ofthe new age. ACM Computing Surveys, 2008.

[28] and J. Ponce year = S. Lazebnik, C. Schmid.[29] D. Goldfarb J. Xu S. Osher, M. Burger and W. Yin. An iterated regu-

larization method for total variation-based image restoration. MultiscaleModel., 2005.

[30] andDavid Grangier. Samy Bengio, Jason Weston. Label embedding treesfor large multi-class task. In Proc.Advances in Neutral InformationProcessing Systems, 2010.

[31] D. Goldfarb W. Yin, S. Osher and J. Darbon. Bregman iterativealgorithms for ‘1-minimization with applications to compressed sensing.SIAM Journal on Imaging Sciences, 2008.

[32] S. Gong Y. Wang. Refining image annotation using contextual relationsbetween words. In ACM CIVR, 2007.

[33] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering.In Proc.Advances in Neutral Information Processing Systems, 2005.

[34] N. Zhou and J. Fan. Jointly learning visually correlated dictionariesfor large-scale visual recognition applications. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI), 2014.

Documents

Fast Learning Discriminative Dictionaries for Large-scale