13
Editors Choice Article Sparse feature selection based on graph Laplacian for web image annotation Caijuan Shi a,b,c,1 , Qiuqi Ruan a,c , Gaoyun An a,c a Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China b College of Information Engineering, Hebei United University, Tangshan 063009, China c Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China abstract article info Article history: Received 31 July 2013 Received in revised form 6 December 2013 Accepted 30 December 2013 Keywords: Web image annotation Sparse feature selection l 2,1/2 -matrix norm Semi-supervised learning Graph Laplacian Confronted with the explosive growth of web images, the web image annotation has become a critical research issue for image search and index. Sparse feature selection plays an important role in improving the efciency and performance of web image annotation. Meanwhile, it is benecial to developing an effective mechanism to leverage the unlabeled training data for large-scale web image annotation. In this paper we propose a novel sparse feature selection framework for web image annotation, namely sparse Feature Selection based on Graph Laplacian (FSLG) 2 . FSLG applies the l 2,1/2 -matrix norm into the sparse feature selection algorithm to select the most sparse and discriminative features. Additional, graph Laplacian based semi-supervised learning is used to exploit both labeled and unlabeled data for enhancing the annotation performance. An efcient iterative algorithm is designed to optimize the objective function. Extensive experiments on two web image datasets are performed and the results illustrate that our method is promising for large-scale web image annotation. © 2014 Elsevier B.V. All rights reserved. 1. Introduction The recent years have witnessed the continuously explosive growth of the web images, how to annotate these countless web images effec- tively has become a critical research issue. Usually, images are repre- sented by different types of features, so feature selection is widely used to improve the web image annotation performance. Facing the large-scale web images, traditional feature selection algorithm such as Fisher Score [1] and ReliefF [2] become less effective and efcient because they select features one by one. Recently, sparse feature selection has received a lot of attention in web image annotation [36] for achieving better performance. In this paper, we exploit the latest mathematical advance in sparse feature selection to boost web image annotation performance. Recently, sparse feature selection has received mushroom develop- ment. The most well-known sparse model is l 1 -norm (lasso) [7] and Cai et al. have used an l 1 -regularized regression model to select features jointly [8]. In spite of the computational convenience, sparse feature selection methods based on l 1 -norm model select features insufciently sparse sometimes. Much works [912] have extended the l 1 -norm to the l p -norm (0 b p b 1). In [13] and [14], Xu et al. have proposed that when p is 1/2, the l p -norm, i.e., l 1/2 -norm regularization has the best performance. However, all of the above methods neglect the useful information of the correlation between different features. In [15], Nie et al. have introduced a joint l 2,1 -norm minimization on both loss function and regularization for feature selection, which realizes sparse features selection with the correlation between features. In [16] Ma et al. have proposed l 2,p -norm (0 b p b 2) and apply it to the multimedia event detection to remove the shared irrelevance and noise for obtaining a more discriminative event detector. The lower p is, the more correlated are the concept classier and the event detector. Because l p -norm (0 b p b 1) can select more sparse features than l 1 - norm, Wang et al. [17] recently have proposed an idea to extend l 2,1 - norm to l 2,p -matrix norm (0 b p 1) so as to select joint, more sparse features, at the same time this method has better robustness than l 2,1 - norm. In this paper we propose a novel sparse feature selection method based on l 2,1/2 -matrix norm and apply it to web image annotation. It is very time-consuming and labor-intensive to label a large number of training data manually, so how to utilize a small number labeled data to annotate large-scale web images becomes a challenging problem. In [18], Yang et al. have proposed to exploit the information shared by multiple related tasks for multimedia content analysis to overcome the limitations of small number labeled data. This work is based on the assumption that the multiple tasks are correlated. How- ever, automatic evaluation of the correlation among multiple tasks is an open problem. Because the unlabeled data are easy to be acquired Image and Vision Computing 32 (2014) 189201 Editor's Choice Articles are invited and handled by a select rotating 12 member Editorial Board committee. This paper has been recommended for acceptance by Sinisa Todorovic. E-mail addresses: [email protected] (C. Shi), [email protected] (Q. Ruan), [email protected] (G. An). 1 Tel.: +86 15630555090. 2 FSLG: sparse Feature Selection based on Graph Laplacian. 0262-8856/$ see front matter © 2014 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.imavis.2013.12.013 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Sparse feature selection based on graph Laplacian for web image annotation

  • Upload
    gaoyun

  • View
    217

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Sparse feature selection based on graph Laplacian for web image annotation

Image and Vision Computing 32 (2014) 189–201

Contents lists available at ScienceDirect

Image and Vision Computing

j ourna l homepage: www.e lsev ie r .com/ locate / imav is

Editor’s Choice Article

Sparse feature selection based on graph Laplacian for webimage annotation☆

Caijuan Shi a,b,c,1, Qiuqi Ruan a,c, Gaoyun An a,c

a Institute of Information Science, Beijing Jiaotong University, Beijing 100044, Chinab College of Information Engineering, Hebei United University, Tangshan 063009, Chinac Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China

☆ Editor's Choice Articles are invited and handled byEditorial Board committee. This paper has been recommTodorovic.

E-mail addresses: [email protected] (C. Shi),(Q. Ruan), [email protected] (G. An).

1 Tel.: +86 15630555090.2 FSLG: sparse Feature Selection based on Graph Laplac

0262-8856/$ – see front matter © 2014 Elsevier B.V. All rihttp://dx.doi.org/10.1016/j.imavis.2013.12.013

a b s t r a c t

a r t i c l e i n f o

Article history:Received 31 July 2013Received in revised form 6 December 2013Accepted 30 December 2013

Keywords:Web image annotationSparse feature selectionl2,1/2-matrix normSemi-supervised learningGraph Laplacian

Confronted with the explosive growth of web images, the web image annotation has become a critical researchissue for image search and index. Sparse feature selection plays an important role in improving the efficiency andperformance of web image annotation. Meanwhile, it is beneficial to developing an effective mechanism toleverage the unlabeled training data for large-scale web image annotation. In this paper we propose a novelsparse feature selection framework for web image annotation, namely sparse Feature Selection based onGraph Laplacian (FSLG)2. FSLG applies the l2,1/2-matrix norm into the sparse feature selection algorithm to selectthe most sparse and discriminative features. Additional, graph Laplacian based semi-supervised learning is usedto exploit both labeled and unlabeled data for enhancing the annotation performance. An efficient iterativealgorithm is designed to optimize the objective function. Extensive experiments on two web image datasetsare performed and the results illustrate that our method is promising for large-scale web image annotation.

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

The recent years have witnessed the continuously explosive growthof the web images, how to annotate these countless web images effec-tively has become a critical research issue. Usually, images are repre-sented by different types of features, so feature selection is widelyused to improve the web image annotation performance. Facing thelarge-scale web images, traditional feature selection algorithm such asFisher Score [1] and ReliefF [2] become less effective and efficientbecause they select features one by one. Recently, sparse featureselection has received a lot of attention in web image annotation [3–6]for achieving better performance. In this paper, we exploit the latestmathematical advance in sparse feature selection to boost web imageannotation performance.

Recently, sparse feature selection has received mushroom develop-ment. The most well-known sparse model is l1-norm (lasso) [7] andCai et al. have used an l1-regularized regressionmodel to select featuresjointly [8]. In spite of the computational convenience, sparse featureselectionmethods based on l1-normmodel select features insufficiently

a select rotating 12 memberended for acceptance by Sinisa

[email protected]

ian.

ghts reserved.

sparse sometimes. Much works [9–12] have extended the l1-norm tothe lp-norm (0 b p b 1). In [13] and [14], Xu et al. have proposed thatwhen p is 1/2, the lp-norm, i.e., l1/2-norm regularization has the bestperformance. However, all of the above methods neglect the usefulinformation of the correlation between different features. In [15], Nieet al. have introduced a joint l2,1-norm minimization on both lossfunction and regularization for feature selection, which realizes sparsefeatures selection with the correlation between features. In [16] Maet al. have proposed l2,p-norm (0 b p b 2) and apply it to themultimediaevent detection to remove the shared irrelevance and noise forobtaining a more discriminative event detector. The lower p is, themore correlated are the concept classifier and the event detector.Because lp-norm (0 b p b 1) can select more sparse features than l1-norm, Wang et al. [17] recently have proposed an idea to extend l2,1-norm to l2,p-matrix norm (0 b p ≤ 1) so as to select joint, more sparsefeatures, at the same time this method has better robustness than l2,1-norm. In this paper we propose a novel sparse feature selectionmethodbased on l2,1/2-matrix norm and apply it to web image annotation.

It is very time-consuming and labor-intensive to label a largenumber of training data manually, so how to utilize a small numberlabeled data to annotate large-scale web images becomes a challengingproblem. In [18], Yang et al. have proposed to exploit the informationshared by multiple related tasks for multimedia content analysis toovercome the limitations of small number labeled data. This work isbased on the assumption that the multiple tasks are correlated. How-ever, automatic evaluation of the correlation among multiple tasks isan open problem. Because the unlabeled data are easy to be acquired

Page 2: Sparse feature selection based on graph Laplacian for web image annotation

Fig. A. Annotation results of five images from each dataset respectively. The top line shows the annotation results by using 10% labeled training data and the second line shows theannotation results by using 25% labeled training data of each dataset. The wrong keywords are indicated by red color.

190 C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

and are capable of improving the annotation performance, many semi-supervised learning approaches to exploiting small number labeled dataand large number unlabeled data have been proposed [19–21]. On onehand, the semi-supervised learning can save human labor cost forlabeling a large amount of images, on the other hand, which can makefull use the unlabeled data to improve the performance. In this paperwe apply semi-supervised learning into large-scale web image annota-tion by exploiting small number labeled data and large numberunlabeled data.

In [22], Zhu has reviewed different approaches for semi-supervisedlearning. Graph Laplacian based semi-supervised learning has obtainedmost research interest and whose key technique is the graph construc-tion. Jebara et al. [23] have proposed to construct the graph by using b-matching, but this method has high computational complexity and isnot suitable for large scale web images. Because of the simplicity, theKNN graph Laplacian based semi-supervised learning is widely usedand many related methods have been proposed for web image annota-tion. In [6], Ma et al. have proposed a semi-supervised framework builtupon feature selection for automatic image annotation. In [19], Yanget al. have proposed a new framework for web image annotation byintegrating shared structure learning and graph-based learning into ajoint framework. In [20], Tang et al. have proposed a novel kNN-sparsegraph-based semi-supervised learning approach to harness the labeledand unlabeled data simultaneously for noisily-tagged web image

annotation. In [24], Wang et al. have used the sample reconstructionmethod to construct a graph for web image annotation. In this paper,we apply KNN graph Laplacian based semi-supervised learning intoour sparse feature selection framework to build our model.

In this paper, we propose a new sparse feature selection frameworkbased on l2,1/2-matrix norm and KNN graph Laplacian based semi-supervised learning for web image annotation.We call it sparse FeatureSelection based on Graph Laplacian (FSLG). By using the l2,1/2-matrixnorm model, FSLG can select more sparse and discriminative features.Additionally, our method utilizes Graph Laplacian based semi-supervised learning to exploit small number labeled data and largenumber unlabeled data for achieving better web image annotation per-formance. Extensive experiments are performed on NUS-WIDE [25]dataset and MSRA-MM 2.0 [26] dataset and the results demonstratethat our algorithm outperforms state-of-the-art feature selection algo-rithms for web image annotation.

The main contributions of our work are as follows:

• Sparse Feature Selection based on Graph Laplacian framework is pro-posed, which utilizes the l2,1/2-matrix norm to select the most sparseand discriminative features and uses the graph Laplacian basedsemi-supervised learning to simultaneously exploit small number la-beled data and large number unlabeled data to annotate large-scaleweb images;

Page 3: Sparse feature selection based on graph Laplacian for web image annotation

191C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

• A fast iterative algorithm for optimizing the objective function is pro-posed, and the convergence of the algorithm is proven;

• Several experiments are conducted on two web image datasets andthe results demonstrate the effectiveness of our method.

The rest of this paper is organized as follows. In Section 2, we brieflyreview related works, including sparse feature selection, graph basedsemi-supervised learning and web image annotation. In Section 3, wedescribe the details of the proposed model FSLG followed by the pro-posed solution. In Section 4, we conduct extensive experiments to dem-onstrate the promising performance of our method for web imageannotation. The conclusion is given in Section 5.

2. Related work

In this section, we briefly discuss three related works about ourmethod, which are sparse feature selection, graph Laplacian semi-supervised learning and web image annotation.

Fig. B. Performance variance w.r.t the percentage of labeled training data. Our methodoutperforms all other algorithms on two datasets. Fig. B.1 NUS-WIDE dataset. Fig. B.2MSRA-MM dataset.

2.1. Sparse feature selection

Sparse feature selection is aim to use a variety of sparse model toselect themost discriminative features and achieve the sparse represen-tation of the data. Sparse feature selection can not only boost the imageannotation performance, but also can reduce the computational cost.

The most well-known sparse model is the l1-norm, which is widelyused to reformulate sparse feature selection as a convex problem, butit can't select features sufficiently sparse sometimes. Recently, in orderto obtain more sparse representation, many works have extended thel1-norm model to the lp-norm (0 b p b 1) model. In [13,14], Xu et al.have proposed that the l1/2-norm regularization has the most sparsityand the best performance when p belongs to (0, 1]. However, theabove methods neglect the correlation between the features. In [15]Nie et al. have proposed the robust l2,1-norm, which can achieve thesparse features selection across all data points. In order to conductmore effective feature selection across all data points, Wang et al.[17] have proposed l2,p-matrix norm and point out when p is 1/2, thel2,1/2-matrix norm regularization can obtain the best performance andsparsity across all data points.

Let G ∈ Rd × c be the projection matrix used for sparse featureselection. A generally sparse feature selection framework to obtain Gis to minimize the following regularized empirical error

minG

loss Gð Þ þ λR Gð Þ ð1Þ

Where loss(⋅) is the loss function and λR(G) is the regularizationwith λ as its regularization parameter.

The l2,p-matrix norm of G is defined as:

Gk k2;p ¼ ðXdi¼1

gi��� ���p

2Þ1=p

p∈ 0;1ð � ð2Þ

Obviously, l2,p-matrix norm is reduced to l2,1-norm when p = 1.When p is belongs to (0, 1), the l2,p-matrix norm becomes a bettersparsemodel than l2,1-normbecause lp-norm can findmore sparse solu-tion than l1-norm. In addition, for any p belongs to (0, 1), the noisemag-nitude of distant outlier based on l2,p-matrix norm is no more than thatbased on l2,1 norm. So the l2,p-matrix norm is expected to select morerobust features [17]. Note that l2,p-matrix norm (0 b p b 1) does notadmit the triangular inequality and l2,p-matrix norm (0 b p b 1) is notconvex. In [17], an efficient approach to solve the l2,p-matrix normproblem (0 b p ≤ 1) has been proposed.

When p is 1/2, the l2,1/2-matrix norm regularization has the bestperformance and sparsity across all data points [17]. So we apply l2,1/2-matrix norm regularization into Eq. (1) for sparse feature selection

minG

loss Gð Þ þ λ Gk k1=22;1=2 ð3Þ

2.2. Graph Laplacian semi-supervised learning

Semi-supervised learning can exploit both labeled and unlabeleddata, so it is very suitable for web image annotation with only a fewavailable labels. On one hand, the semi-supervised learning can savehuman labor cost for labeling a large number of training data. On theother hand, it can exploit unlabeled data which are easy to be acquiredto improve annotation performance. Because of the simplicity, the KNNgraph Laplacian based semi-supervised learning has obtained muchresearch interest and many related methods have been proposed forweb image annotation [19,20].

Denote X= [x1, x2 ⋯, xm, xm + 1, ⋯, xn]T as the feature matrix of train-ing images, where m is the number of the labeled images and n is thetotal number of the training images. xi ∈ Rd(1 ≤ i ≤ n) is the ithdatum. Given Y = [y1, y2 ⋯, ym, ym + 1, ⋯, yn]T ∈ {0,1}n × c be the label

Page 4: Sparse feature selection based on graph Laplacian for web image annotation

Table A.1Performance comparison (MicroAUC ± standard deviation and MacroAUC ± standard deviation). 5% labeled training data is used and the best results are shown in boldface.

Dataset Metrics FSLG SFSS FSNM SMBLR FSSI

NUS-WIDE MicroAUC 0.873 ± 0.004 0.822 ± 0.003 0.791 ± 0.006 0.644 ± 0.003 0.167 ± 0.003MacroAUC 0.630 ± 0.003 0.608 ± 0.005 0.594 ± 0.003 0.566 ± 0.002 0.429 ± 0.005

MSRA-MM MicroAUC 0.877 ± 0.002 0.837 ± 0.004 0.816 ± 0.006 0.620 ± 0.002 0.145 ± 0.002MacroAUC 0.564 ± 0.003 0.546 ± 0.005 0.538 ± 0.003 0.527 ± 0.003 0.461 ± 0.003

192 C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

matrix of training images, where c is the number of classes andyi ∈ Rc(1 ≤ i ≤ n) is the ith label vector. Let Yij denote the j-th datumof yi, then Yij = 1 if xi is in the j-th class, while Yij = 0 otherwise. If xi isnot labeled, i.e. i N m, yi is set to a vector with all zeros.

Define a graph model S, whose element Sij reflects the similaritybetween xi and xj on the graph as:

Sij ¼ expð− xi−xj

��� ���2σ2 Þ xi and xj are k nearest neighbors

0 otherwise

8><>: ð4Þ

S can be redefined as follows for reducing the number of parameters

Sij ¼ 1 xi and xj are k nearest neighbors0 otherwise

�ð5Þ

Here S in Eq. (5) is a special case of S in Eq. (4), provided that σapproaches to ∞.

LetDii ¼ ∑n

j¼1Sij, which is a diagonal matrix, then L= D–S is the graph

laplacian matrix.In [27] Zhou et al. have proposed a transductive classification

algorithm by label propagation. Following the recent transductiveclassification algorithm [22,28], in order to exploit both labeled dataand unlabeled data, a predicted label matrix is defined as F = [f1, f2 ⋯,fn]T ∈ Rn × c for all the training data. fi ∈ Rc(1 ≤ i ≤ n) is the predictedlabel of xi. According to [29], F should simultaneously satisfy thesmoothness on the ground truth labels of the training data and thegraph model S. Hence, F can be obtained by minimizing the followingobjective function:

argminF

Xcl¼1

12

Xni; j¼1

Fil−Fjl� �2

Sij þXni¼1

Uii Fil−Yilð Þ2#2

4 ð6Þ

Where Fil is the l-th element of fi, Where U ∈ Rn × n is a diagonalmatrix and named as a decision rule matrix, whose diagonal elementsUii = ∞ if xi is labeled data and Uii = 1 otherwise. This decision rulematrix U makes the predicted labels F consistent with the groundtruth labels Y.

Eq. (6) can be rewritten as

argminF

Tr FTLF� �

þ Tr F−Yð ÞTU F−Yð Þ� �

ð7Þ

The graph-based transductive algorithms have been applied tomany multimedia applications, such as content-based multimediaretrieval [30,31], image annotation [32,6,19], and video annotation [33].

Table A.2Performance comparison (MicroAUC ± standard deviation and MacroAUC ± standard deviati

Dataset Metrics FSLG SFSS

NUS-WIDE MicroAUC 0.888 ± 0.001 0.851 ± 0.00MacroAUC 0.679 ± 0.005 0.636 ± 0.00

MSRA-MM MicroAUC 0.885 ± 0.002 0.857 ± 0.00MacroAUC 0.587 ± 0.002 0.569 ± 0.00

Different fromprevious semi-supervised learningalgorithms [34–37],here we minimize the prediction error with respect to the label predic-tionmatrix to exploit the unlabeled data [38]. Inspired by [29],we designa robust loss function of Eq. (1) as follows

argminF;G;b

Tr FTLF� �

þ Tr F−Yð ÞTU F−Yð Þ� �

þ μ XTGþ 1nbT−F

��� ���2F

ð8Þ

Where b ∈ Rc is the bias term and 1n ∈ Rn denotes a column vectorwith all its elements being 1. μ is regularization parameter. From theabove function the label prediction matrix and the classifier can besimultaneously learned.

2.3. Web image annotation

Usually, an image is represented by different types of features, soimage annotation exploits the correspondence between differentfeatures and semantic concepts of the images. Nonetheless, not all thefeatures are useful to image annotation and the irrelevant and/orredundant features should be reduced. Because feature selection can ex-tract important features and reduce the noise to better represent theimages, much work [3–6] has focused on optimizing the feature selec-tion process in their annotation frameworks. Facing the countless andcomplex web image resources, the sparse features selection becomesmore important because it can select themost discriminative and sparsefeatures with good robustness. Moreover, many of the web image re-sources are unlabeled, so utilizing the semi-supervised learningmethodto exploit small number labeled data and large number unlabeled data ishelpful to improve the web image annotation performance.

In this paper, we propose a novel sparse feature selection frameworkwhich selects the most discriminative features based on l2,1/2-matixnorm and graph Laplacian based semi-supervised learning to boostthe web image annotation performance.

3. The proposed framework

In this section, we propose a sparse feature selection framework forweb image annotation, namely sparse Feature Selection based on GraphLaplacian (FSLG). We first introduce the formulation of FSLG and thendesign an effective algorithm to solve the objective function of FSLG.

3.1. FSLG formulation

The basic idea of our method is to realize the sparse feature selectionbased on l2,1/2-matrix normwith graph Laplacian based semi-supervisedlearning.

Denote X=[x1, x2 ⋯, xm, xm + 1, ⋯, xn]T as the featurematrix of trainingimages, where m is the number of the labeled images and the total

on). 10% labeled training data is used and the best results are shown in boldface.

FSNM SMBLR FSSI

3 0.839 ± 0.003 0.693 ± 0.006 0.155 ± 0.0025 0.625 ± 0.004 0.608 ± 0.003 0.403 ± 0.0046 0.838 ± 0.009 0.636 ± 0.003 0.139 ± 0.0013 0.553 ± 0.005 0.538 ± 0.005 0.441 ± 0.002

Page 5: Sparse feature selection based on graph Laplacian for web image annotation

Table A.3Performance comparison (MicroAUC ± standard deviation and MacroAUC ± standard deviation). 25% labeled training data is used and the best results are shown in boldface.

Dataset Metrics FSLG SFSS FSNM SMBLR FSSI

NUS-WIDE MicroAUC 0.901 ± 0.001 0.874 ± 0.002 0.857 ± 0.001 0.750 ± 0.002 0.134 ± 0.001MacroAUC 0.716 ± 0.004 0.692 ± 0.003 0.676 ± 0.002 0.661 ± 0.003 0.355 ± 0.002

MSRA-MM MicroAUC 0.891 ± 0.001 0.872 ± 0.002 0.856 ± 0.008 0.641 ± 0.004 0.125 ± 0.001MacroAUC 0.620 ± 0.004 0.601 ± 0.004 0.588 ± 0.002 0.555 ± 0.002 0.411 ± 0.002

193C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

number of the training images is n. xi ∈ Rd(1 ≤ i ≤ n) is the ith datum.Given Y = [y1, y2 ⋯, ym, ym + 1, ⋯, yn]T ∈ {0,1}n × c be the label matrix oftraining images, where c is the number of classes and yi ∈ Rc(1 ≤ i ≤ n)is the ith label vector. Let Yij denote the j-th datum of yi, then Yij = 1 ifxi is in the j-th class, while Yij = 0 otherwise. If xi is not labeled, yi is setto a vector with all zeros.

By integrating the sparse feature selection based on l2,1/2-matrixnorm as given in Eq. (3) with the graph Laplacian based semi-supervised learning as given in Eq. (8), we propose the followingobjective function as our foundation

argminF;G;b

Tr FTLF� �

þ Tr F−Yð ÞTU F−Yð Þ� �

þ μ XTGþ 1nbT−F

��� ���2Fþ λ Gk k1=22;1=2

ð9Þ

In the above function, the regularization term λ‖G‖2,1/21/2 guaranteesour model can select the most sparse and discriminative features withgood robustness. The predicted label matrix F introduced by the graphLaplacian based semi-supervised learning can simultaneously satisfythe smoothness on the ground truth labels of the training data and thegraph model S. This makes our method exploit the small numberlabeled data and large number unlabeled data well for selecting moreimportant features.

FromEq. (9) F, G and b can be solved at the same time. Once they areobtained, which can be utilized directly for image annotation. Given thetesting image features Xtest, the label prediction matrix Ftest is

Ftest ¼ XTtestGþ 1nb

T ð10Þ

It is worthmentioning that our framework is different from [29] and[6]. Themethod in [29] is dedicated to classification and dimensionalityreduction and not to feature selection. Though the method in [6] is ableto select features across all the data points, the selected features are lesssparse and robust than our method. Moreover, the way to obtain G ofour framework differs from [29] and [6]. In [29] G is calculated easilyin a closed form.We also can't directly adopt the algorithm in [6] becausethe l2,1/2-matrix makes our framework non-convex. To efficiently solveour model, we propose an efficient algorithm to obtain G in the nextSection.

3.2. Solution

As our problem involves the l2,1/2-matrix norm, whichmakes Eq. (9)non-convex. As a result, we solve this objective problem of Eq. (9) asfollows.

Table A.4Performance comparison (MicroAUC ± standard deviation and MacroAUC ± standard deviati

Dataset Metrics FSLG SFSS

NUS-WIDE MicroAUC 0.903 ± 0.001 0.889 ± 0.00MacroAUC 0.746 ± 0.004 0.720 ± 0.00

MSRA-MM MicroAUC 0.895 ± 0.001 0.884 ± 0.00MacroAUC 0.652 ± 0.003 0.628 ± 0.00

By setting the derivative of Eq. (9) w.r.t b to zero, we have

2μ XTGþ 1nbT−F

� �¼ 0

⇒bT ¼ 1n

1nT F−1n

TXTG� � ð11Þ

Substituting bT in Eq. (9) with Eq. (11), the objective functionbecomes

argminF;G

Tr FTLF� �

þ Tr F−Yð ÞTU F−Yð Þ� �

þμ XTGþ 1n1n1n

T F−1n1n1n

TXTG−F����

����2Fþ λ Gk k1=22;1=2

⇒ argminF;G

Tr FTLF� �

þTr F−Yð ÞTU F−Yð Þ� �

þ μ AXTG−AF��� ���2

Fþ λ Gk k1=22;1=2

ð12Þ

Where we define A ¼ I− 1n1n1n

T ; A ¼ AT ¼ A2.By setting the derivative of Eq. (12) w.r.t F to zero, we have

2LF þ 2U F−Yð Þ−2μA AXTG−AF� �

¼ 0⇒F ¼ JKWhere J ¼ Lþ U þ μAð Þ−1 and K ¼ UY þ μAXTG

ð13Þ

Substituting F in Eq. (12) with Eq. (13), the objective functionbecomes

argminG

Tr KT JT�Lþ U

� �JK−KT JTUY−YTUJK

þμGTXAXTG−μGTXAJK−μKT JTAXTGþ μKT JTAJK�þ λ Gk k1=22;1=2

ð14Þ

Since Tr(KTJTUY)= Tr(YTUJK) and Tr(μGTXAJK)= Tr(μKTJTAXTG), thenEq. (14) becomes

argminG

TrðKT JT Lþ U þ μHð ÞJK−2KT JTUY þ μGTXAXTG−2μKT JTAXTGÞ þ λ Gk k1=22;1=2

⇒ argminG

Tr KT JT J−1JK−2KT JT UY þ μAXTG� �

þ μGTXAXTG� �

þ λ Gk k1=22;1=2

⇒ argminG

Tr KT JTK−2KT JTK þ μGTXAXTG� �

þ λ Gk k1=22;1=2

⇒ argminG

Tr μGTXAXTG−KT JTKþ� �

þ λ Gk k1=22;1=2

ð15Þ

on). 50% labeled training data is used and the best results are shown in boldface.

FSNM SMBLR FSSI

2 0.878 ± 0.002 0.800 ± 0.004 0.116 ± 0.0013 0.704 ± 0.003 0.683 ± 0.003 0.318 ± 0.0021 0.871 ± 0.009 0.665 ± 0.002 0.116 ± 0.0015 0.614 ± 0.002 0.575 ± 0.004 0.391 ± 0.003

Page 6: Sparse feature selection based on graph Laplacian for web image annotation

194 C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

As K = UY+ μAXTG, substituting it in Eq. (15), then we get

argminG

Tr μGTXAXTG− UY þ μAXTG� �T

JT UY þ μAXTG� �

þ� �

þ λ Gk k1=22;1=2

⇒ argminG

Tr GT XA μI−μ2 J� �

AXT� �

G−2μYTUJAXTG� �

þ λ Gk k1=22;1=2

ð16Þ

Define M = XA(μI − μ2J)AXT and N = μXAJUY, then the objectivefunction becomes the following quadratic problem

argminG

Tr GTMG� �

−2Tr NTG� �

þ λ Gk k1=22;1=2 ð17Þ

Fig. C. The annotation performance variation according to the number of selected featuresof method FSLG and SFSS on two datasets. Fig. C.1 NUS-WIDE dataset. Fig. C.2 MSRA-MMdataset.

Given G = [g1, ⋯,gd]T, then Gk kp2;p ¼ 2p Tr GTDG� �

Gk k1=22;1=2 ¼ 4Tr

GTDG� �

where D is a diagonal matrix whose diagonal elements are

Dii = 1/4‖gi‖23/2.Then the objective in Eq. (17) is equivalent to

argminG

Tr GTMG� �

−2Tr NTG� �

þ 4λTr GTDG� �

ð18Þ

By setting the derivative of Eq. (18) w.r.t G to zero, we have:

2MG−2N þ 2� 4λDG ¼ 0⇒G ¼ M þ 4λDð Þ−1N ð19Þ

An iterative approach is proposed in Algorithm 1 to solve theobjective problem in Eq. (18).

Algorithm 1. The FSLG algorithm.

Input:The training image feature X ∈ Rd × n; The training image

labels Y ∈ Rn × c;Regularization parameters μ and λ;

Output:Optimized projected matrix G ∈ Rd × c.

1: Compute the graph Laplacian matrix L ∈ Rn × n;2: Compute the decision rule matrix U ∈ Rn × n;3: A ¼ I− 1=nð Þ � 1n1n

T ;4: J = (L + U + μA)−1;5: M = XA(μI − μ2J)AXT;6: N = μXAJUY;7: set t = 0 and initialize G0 ∈ Rd × c randomly.8: repeat

Compute the diagonal matrix Dt as

Dt ¼1

4 g1tk k3=2

2 � � �1

4 gdtk k3=2

2

264

375

Update Gt + 1 = (M + 4λDt)−1N;t = t + 1.until Convergence;

3: Return G.

Here we briefly discuss the computational complexity of the pro-posed algorithm. At the training stage, we need compute the graphLaplacianmatrix L and the decision rulematrix U, whose computationalcomplexity isO(n2). Learning the optimized projectedmatrixG involvescalculating the inverse of a fewmatrices, the largest complexity isO(n3).Thus, the complexity of training process is aboutO(n3). At the test stage,once G is obtained, we need to perform c × d × ntest times multiplica-tions to predict the concepts of the testing images. Here ntest is the num-ber of the test images. For large scale datasets, ntest ≫ c and ntest ≫ d.Therefore, the image annotation complexity of our framework is ap-proximately linear with respect to ntest, making it suitable to annotatelarge-scale web images.

The proposed iterative approach in Algorithm 1 can be verified toconverge to the optimal G by Theorem 1. Before that we introducetwo lemmas.

Lemma 1. If φ(t) = 4t − t4 − 3, then for any t N 0, φ(t) ≤ 0.

Proof. See Appendix A.

Lemma 2. Denote Gt as the optimal result of the tth iteration and Gt + 1

as the variable of (t + 1)th iteration of Algorithm 1, at the same time,

Page 7: Sparse feature selection based on graph Laplacian for web image annotation

Fig. D. The number of selected features of each type is different in terms of the annotated labels. The first line imageswith the labels and the selected features are fromNUS-WIDE datasetand the second line images with the labels and the selected features are from MSRA-MM dataset.

195C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

suppose that gti and gt + 1i are the ith row of Gt and Gt + 1 respectively,

then the following inequality holds:

gitþ1

��� ���1=22

−14

gitþ1

��� ���22

gitþ1

�� ��3=22

≤ git��� ���1=2

2−1

4

git��� ���2

2

git�� ��3=2

2

; i ¼ 1; ⋯; d

Then we have the following conclusion:

Xdi¼1

gitþ1

��� ���1=22

−14

gitþ1

��� ���22

gitþ1

�� ��3=22

0B@

1CA≤

Xdi¼1

git��� ���1=2

2−1

4

git��� ���2

2

git�� ��3=2

2

0B@

1CA

Proof. See Appendix B.

Theorem 1. The objective function value shown in Eq. (17) monotoni-cally decreases in each iteration until convergence using the iterativeapproach in Algorithm 1.

Proof. See Appendix C.

The lemmas and the convergence of Algorithm 1 can be provedfollowing the work in [5,15], and [17].

3.3. Nonlinear FSLG model

In our method, the FSLGmodel satisfies the linear assumption of thesparse feature selection. In order to do with the non-linear sparse fea-ture selection, we extend the FSLG model to a nonlinear algorithm viathe kernel tricks.

To this end, we project original space data to a high-dimensionalkernel Hilbert space ℋ with a kernel mapping function. Here, ϕ(X)and ϕ(Y) is the mapped representation of the training features andthe labels in the kernel space ℋ. Then, the nonlinear FSLG model for

web image annotation can be written as follows:

argminF;G;b

Tr FTLF� �

þ Tr F−ϕ Yð Þð ÞTU F−ϕ Yð Þð Þ� �

þ μ ϕ Xð ÞTGþ 1nbT−F

��� ���2Fþ λ Gk k1=22;1=2 ð20Þ

Note that actually the FSLG is a special case of the kernel FSLG if wechoose to use the Linear Kernel ∅: x → x.

4. Experiments

In this section, we conduct extensive experiments on two webimage datasets to validate the performance of the proposed algorithmfor large-scale web image annotation.

4.1. Image datasets

In our experiments, we have used twoweb image datasets, i.e., NUS-WIDE dataset [25] and MSRA-MM2.0 dataset [26] to test the perfor-mance of our method.

The NUS-WIDE dataset was created by the Lab for Media Search inNational University of Singapore in 2009 and it includes 269,648 real-world images with 81 concepts. Three types of visual features including144-dimension normalized color correlogram (CORR), 128-dimensionnormalized wavelet texture (WT), and 73-dimension normalized edgedirection histogram (EDH) are combined here to form 345-dimensionfeature for representing each image in this dataset.

TheMSRA-MM2.0 dataset was created byMicrosoft Research Asia in2009 and it consists of 50,000 images with 100 concepts. Three types ofvisual features including 144-dimension normalized color correlogram(CORR), 128-dimension normalized wavelet texture (WT), and 75-dimension normalized edge direction histogram (EDH) are combinedto form 347-dimension feature for representing each image in thisdataset.

Page 8: Sparse feature selection based on graph Laplacian for web image annotation

Table BPerformance comparison using different parameter p (MAP).

Datasets p = 0.1 p = 0.25 p = 0.5 p = 0.75 p = 1

NUS-WIDE 0.062 0.067 0.083 0.080 0.077MSRA-MM 0.047 0.050 0.053 0.051 0.046

196 C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

4.2. Compared methods

The model FSLG is compared with some related sparse featureselection algorithms for web image annotation. The comparedmethodsare simply listed as follows.

• Sparse multinomial logistic regression via Bayesian l1 regularization(SBMLR) [39]: It realizes sparse feature selection by using a Laplaceprior based on Bayesian l1 regularization.

Fig. E. The influence of unlabeled data on annotation results of method FSLG and SFSS. Thedark blue bar corresponds to the only labeled data case and the light blue bar correspondsto the all training data case ofmethod SFSS. The yellow bar corresponds to the only labeleddata case and the brown bar corresponds to the all training data case of method FSLG.Fig. E.1 NUS-WIDE dataset. Fig. E.2 MSRA-MM dataset.

• Feature selection via joint l2,1-norms minimization (FSNM) [15]:It employs joint l2,1-norm minimization on both loss function andregularization for joint feature selection.

• Structural Feature Selection with Sparsity (SFSS) [6]: a recent sparsefeature selection algorithm. It utilized semi-supervised learning withthe l2,1-norm for feature selection.

• Feature Selection with Shared Information among multiple tasks(FSSI) [18]: a recent multi-task feature selection algorithm, whichcan borrow the knowledge from other related tasks for featureselection.

In our experiments, ourmethod FSLG,method SFSS andmethod FSSIcan simultaneously realize feature selection and classification. For FSNMand SBMLRmethods,we first use them to perform feature selection, andthen use regularized least square regression to learn classifiers from theselected sparse features.

4.3. Experiment setup

In the experiment, a training set that comprised 3000 images israndomly sampled for each dataset and the rest of the images areused as testing set. The training set includes n labeled samples and n isset to 5%, 10%, 25%, 50% and 100% of the total training set respectively.We observe the performance variation according to the number oflabeled training data and record the corresponding results. The experi-ments are conducted five times independently and the average resultsare reported.

There are several parameters need to tune in our experiments. Thenumber of nearest neighbors to compute the Laplacian matrix k is setto {10, 15, 20, 25}. Two regularization parameters μ and λ in Eq. (9)are in the range of {0.001, 0.01, 0.1, 1, 10, 100, 1000} and the best resultsare reported. Parameter p is implemented from {0.1, 0.25, 0.5, 0.75, 1} inl2,p-matrix norm based optimization problems.

In the experiments, we use three evaluation metrics to evaluate theannotation performance, i.e., Mean Average Precision (MAP),MicroAUCand MacroAUC. MAP is the well-known evaluation metric widely usedfor web image annotation [5]. AUC is a better evaluation metric inclassification problems [40,41]. [42] indicates that microAUC canmeasure the global performance across multiple classes and macroAUCcan measure the average performance of all classes. In microAUC, theclass indicator vectors of different classes are concatenated as a longvector, and then, the average AUC is computed. In macroAUC, themean AUC of each class is first computed, then, the mean AUC valuesof all the classes are averaged.

4.4. Performance evaluation

Fig. A illustrates some annotation results of five images in eachdataset. The top three or four keywords are selected as the labels ofthese images respectively. Though there are somewrong annotation re-sults, Fig. A still indicates the FSLG algorithm suitable for web imageannotation.

We compare our FSLG algorithm with other algorithms when thelabeled data are at different percentages. Fig. B shows the annotationresults with the metric MAP and Tables A.1–A.4 shows the annotationresults with the metrics MicroAUC and MacroAUC. Five times indepen-dent experiments are conducted and the average results are reported.The results illustrate that our method FSLG outperforms the otheralgorithms on both datasets.

We have the following observations from Fig. B and Tables A.1–A.4:First, our method FSLG has the best performance in terms of MAP,MicroAUC and MacroAUC on the two datasets. The observationsindicate that the proposed FSLG can effectively select themost discrim-inative sparse features to boost the annotation performance and has theability to annotate large-scale web images with comparatively smallamount of labeled training data. Second, as the number of labeled

Page 9: Sparse feature selection based on graph Laplacian for web image annotation

Fig. F. MAP variation with μ and λ on the two datasets. The figure shows using different parameters μ and λ results in different annotation results. Fig. F.1 NUS-WIDE dataset. Fig. F.2MSRA dataset.

197C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

Page 10: Sparse feature selection based on graph Laplacian for web image annotation

Table CPerformance comparison using different parameter k (MAP).

Datasets K = 10 K = 15 K = 20 K = 25

NUS-WIDE 0.082 0.083 0.083 0.082MSRA-MM 0.052 0.053 0.053 0.052

198 C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

training images increases, the performance of our method FSLG gradu-ally improves. For example, for NUS-WIDE image dataset, theMicroAUCof FSLG is 0.888 with 10% percentage labeled training images while0.901 when the percentage of labeled training images increases to 25%.

There are two main reasons for the good performance of FSLG forweb image annotation. First, it can select themost discriminative sparsefeatures with good robustness based upon l2,1/2-matrix norm. Second,our method use the graph Laplacian based semi-supervised learningto exploit the small number labeled data and the large number unla-beled data by introducing a predicted label matrix Fwhich can simulta-neously satisfy the smoothness on the ground truth labels of thetraining data and the graph model S.

Fig. G. Convergence curves of the objective function value. The figures show that theobjective function values monotonically decrease until convergence on two datasets.Fig.G.1 NUS-WIDE dataset. Fig.G.2 MSRA dataset.

4.5. Influence of selected features

As the number of selected features can affect the effectiveness andcomputational efficiency for web image annotation, we perform anexperiment to study how the number of selected features affects the per-formance. At the same time, we compare our method FSLG with methodSFSS on two datasets. Here the number of the selected features is set to50, 100, 150, 200, 250 and all respectively.Weuse 10%percentage labeledtraining data and metric MAP. Parameter μ and parameter λ are set to 1and k is set to 15. The results of this experiment are shown in Fig. C.

Fig. C shows that the performance of our method FSLG and methodSFSS varies when the number of selected features changes. From Fig. Cwe have the following observations. 1) When the number of selectedfeatures is too small, MAP is lower than that with all features. Thiscould be attributed to the loss of some useful information. 2) MAP ar-rives at the peak level with 200 features for FSLG method while with250 features for SFSS method on NUS-WIDE dataset. MAP arrives atthe peak level with 150 features for FSLG method while with 250features for SFSS method on MSRA-MM dataset. These indicate FSLGcan obtain better performance with the fewer features compared tomethod SFSS. We conclude that our method can select the more sparseand discriminative features and reduce noise to achieve the betterannotation performance.

Here we also conduct an experiment to study how many features ofeach type are selected in terms of annotated labels (labels with their pre-ferred features). Fig. D shows three images with the corresponding labelsand the selected features from two datasets respectively. The total num-ber of the selected features of each image is 200 for theNUS-WIDEdatasetand 150 for theMSRA-MMdataset. From Fig. Dwe can see the number ofselected features of each type is different in terms of different imageswiththe corresponding labels. For example, for the image with labels “flower”and “plants” from NUS-WIDE dataset, the selected features are 128-dimension WT, 63-dimension CORR and 9-dimension EDH, which indi-cates theWT features are preferred to be selected. For the image with la-bels “wedding” and “flower” from MSRA-MM dataset, the selectedfeatures are 6-dimension WT, 118-dimension CORR and 26-dimensionEDH, which indicates the CORR features are preferred to be selected.

4.6. Influence of the unlabeled data

As ourmethod uses semi-supervised learning to exploit both labeledand unlabeled data, an experiment is especially designed to discover theimpact of the unlabeled data on the annotation results. One case is thatonly the labeled training data are used to learn the FSLG model, theother case is that all the training labeled data and unlabeled data areused to learn the model. Then we compare the results of the two casesand find the influence of the unlabeled data on the annotation results.The training labeled data are set to 5%, 10%, 25% and 50% percentagerespectively. In addition, we compare our method FSLG with SFSS inthis experiment. The results are illustrated in Fig. E.

Fig. E illustrates that using all training data including unlabeled dataand labeled data achieves better results than only using the labeled dataon two datasets. That implies using unlabeled data is able to enhanceannotation performance. At the same time, we can see our methodFSLG yields better annotation results over SFSS with only labeledtraining data as well as with all training data. This again indicates ourmethod FSLG has a better ability to annotate web images by exploitingsmall number labeled data and large number unlabeled data.

4.7. Influence of parameter p

Our method FSLG can realize the better sparse feature selectionbased on l2,1/2-matrix norm. The l2,1/2-matrix norm is the special caseof l2,p-matrix norm when p is equal to 1/2. In order to demonstratethe influence of parameter p on annotation results, typical p belongsto (0, 1] are tested in this experiment. Here we implement the

Page 11: Sparse feature selection based on graph Laplacian for web image annotation

199C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

optimization problemwith different p, which is set to 0.1, 0.25, 0.5, 0.75and 1 respectively. Table B shows the annotation results according todifferent parameter p on NUS-WIDE dataset and MSRA-MM2.0 dataset.The results in bold indicate the best performance.

We conduct this experiment with 10% percentage labeled trainingdata and use the metric MAP. The experiment produces different anno-tation results corresponding to different parameter p. From Table B wehave the following observations. 1) The best annotation results are pro-duced when p is 0.5. 2) When p belongs to (0, 0.5), the closer to 0 the pis, the worse the performance is. This indicates the features should notbe selected too sparse, otherwise much useful information is lost. 3) Ifp is near 1, themodel is almost SFSS. To sumup, the results of this exper-iment show that our method FSLG (p = 0.5) can achieve the best webimage annotation performance with the most discriminative features.

4.8. Parameter sensitivity study

An experiment is conducted here to learn about the sensitivityof regularization parameters μ and λ in Eq. (9). They are set to 0.001,0.01, 0.1, 1, 10, 100 and 1000 respectively. Following the aboveexperiment, 10% percentage labeled training data is used and MAP istaken as the metric.

Fig. F demonstrates that the MAP varies with μ and λ on the twodatasets. Here we set k to15 and use 10% percentage labeled trainingdata. From Fig. F we obtain the different annotation performancecorresponding to different values' combinations of μ and λ. When μand λ is approximately equal, the annotation performance is betterthan the case when they are different too much. When μ = λ = 10the annotation performance is best for NUS-WIDE dataset, whilewhen μ = λ = 1 the annotation performance is best for MSRA-MM2.0 dataset.

4.9. Influence of parameter k

An experiment is conducted to test the influence of parameter k,which is the number of the nearest neighbors to compute theLaplacian matrix. Following the above experiment, 10% percentagelabeled training data is used and parameter μ and parameter λ areset to 1.

Table C shows that the performance varies with parameter k. Thenumber of k nearest neighbors is set to 10, 15, 20 and 25 in thisexperiment. We observe from Table C that the annotation performanceof FSLG varies slightly with different k on two datasets. So, the numberof the nearest neighbors to compute the Laplacian matrix k is set to 15in other experiments of our paper.

4.10. Convergence analysis

As proved before, the objective function value shown in Eq. (17)monotonically decreases until convergence by using the iterativeapproach Algorithm 1. Here we conduct an experiment on two datasetsto show the convergence and the convergence curves are shown inFig. G. We conduct this experiment on a personal computer (2.90 GHz,4 Gb RAM) with Matlab (2010b).

Fig. G shows the convergence curves of our method FSLG on NUS-WIDE and MSRA-MM datasets. For NUS-WIDE dataset, it can beobserved from Fig.G.1 that the objective function value converges via6 iterations and the convergence time is 4.60 s. For MSRA dataset, wecan observe from Fig.G.2 that the objective function value convergesvia 16 iterations and the corresponding convergence time is 18.5 s.This experiment illustrates our method FSLG has good convergence,which indicates FSLG can selectmore sparse and discriminative featuresand is capable to annotate large-scale web images.

5. Conclusion

In this paper, we propose a new sparse feature selection frameworkFSLG for web image annotation. This model can select more sparsefeatures based on l2,1/2-matrix norm,meanwhile it exploits both labeledand unlabeled data via graph-based semi-supervised learning. By usingthe l2,1/2-matrix norm, FSLG can select most sparse and discriminativefeatures with good robustness to enhance the large-scale web imageannotation performance. Additionally, the Graph Laplacian basedsemi-supervised learning can save human labor cost of labeling a largeamount of images and improve the web image annotation performanceby exploiting small number labeled data and large number unlabeleddata. Although the objective function of the FSLG model is non-convex,an effective algorithm to obtain the optimal solution is introduced andthe convergence of the algorithm is proven. Extensive experiments areperformed on two web image datasets. The results demonstrate thatour algorithm outperforms state-of-the-art algorithms and is suitablefor large-scale web image annotation.

Acknowledgments

This work was supported partly by the National Natural ScienceFoundation of China (61172128, 61003114), the National Key BasicResearch Program of China (2012CB316304), the Fundamental Re-search Funds for the Central Universities (2013JBM020, 2013JBZ003),the Program for Innovative Research Team in University of Ministry ofEducation of China (IRT201206) and the Doctoral Foundation of ChinaMinistry of Education (20120009120009).

Appendix A. Proof of Lemma 1

Proof. Taking derivative of φ(t) with respect to t, and setting it to zero,that is

φ0 tð Þ ¼ 4−4t3 ¼ 0

Then we have the unique stationary point t = 1 on (0, +∞).It can be easily proved that t = 1 is just the maximum point.Henceφ(t) ≤ φ(1) = 0, t N 0.

Appendix B. Proof of Lemma 2

Proof. for each row i, according to Lemma 1

Let t ¼ gitþ1k k1=2

2

gitk k1=2

2

in φ(t), then φgitþ1k k1=2

2

gitk k1=2

2

� �≤0, that is 4

gitþ1k k1=2

2

gitk k1=2

2

− gitþ1k k2

2

gitk k2

2

≤3

Multiplying the two sides of above formula with 14 git�� ��1=2

2 , thenwe have

gitþ1

��� ���1=22

−14

gitþ1

��� ���22

gitþ1

�� ��3=22

≤ git��� ���1=2

2−1

4

git��� ���2

2

git�� ��3=2

2

Thus, by summing over all the rows (1 ≤ i ≤ d) above inequalitieswe get the conclusion of Lemma 2.

Xdi¼1

gitþ1

��� ���1=22

−14

gitþ1

��� ���22

gitþ1

�� ��3=22

0B@

1CA≤

Xdi¼1

git��� ���1=2

2−1

4

git��� ���2

2

git�� ��3=2

2

0B@

1CA

Page 12: Sparse feature selection based on graph Laplacian for web image annotation

200 C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

Appendix C. Proof of Theorem 1

Proof. According to Algorithm 1, it can be inferred from Eq. (18) that

Gtþ1 ¼ argminG

Tr GTMG� �

−2Tr NTG� �

þ λTr GTDG� �

Therefore, we have

∴Tr Gtþ1TMGtþ1

� �−2Tr NTGtþ1

� �þ λTr Gtþ1

TDtGtþ1

� �≤Tr Gt

TMGt

� �−2Tr NTGt

� �þ λTr Gt

TDtGt

� �

∵Tr GtTDtGt

� �¼Xdi¼1

git��� ���2

2

4 git�� ��3=2

2

; Tr Gtþ1TDtGtþ1

� �¼Xdi¼1

gitþ1

��� ���22

4 git�� ��3=2

2

∴Tr Gtþ1TMGtþ1

� �−2Tr NTGtþ1

� �

þ λXdi¼1

gitþ1

��� ���22

4 git�� ��3=2

2

≤Tr GtTMGt

� �−2Tr NTGt

� �þ λ

Xdi¼1

git��� ���2

2

4 git�� ��3=2

2

∴Tr Gtþ1TMGtþ1

� �−2Tr NTGtþ1

� �þ λ

Xdi¼1

gitþ1

��� ���22

4 git�� ��3=2

2

−λXdi¼1

gitþ1

��� ���1=22

þλXdi¼1

gitþ1

��� ���1=22

≤Tr GtTMGt

� �−2Tr NTGt

� �þ λ

Xdi¼1

git��� ���2

2

4 git�� ��3=2

2

−λXdi¼1

git��� ���1=2

2þ λ

Xdi¼1

git��� ���1=2

2

∴Tr Gtþ1TMGtþ1

� �−2Tr NTGtþ1

� �þ λ

Xdi¼1

gitþ1

��� ���1=22

−λ

Xdi¼1

gitþ1

��� ���1=22

−Xdi¼1

gitþ1

��� ���22

4 git�� ��3=2

2

!≤Tr Gt

TMGt

� �−2Tr NTGt

� �

þλXdi¼1

git��� ���1=2

2−λ

Xdi¼1

git��� ���1=2

2−Xdi¼1

git��� ���2

2

4 git�� ��3=2

2

!

It has been proven in Lemma 2 that:

Xdi¼1

gitþ1

��� ���1=22

−14

gitþ1

��� ���22

gitþ1

�� ��3=22

0B@

1CA≤

Xdi¼1

git��� ���1=2

2−1

4

git��� ���2

2

git�� ��3=2

2

0B@

1CA

Then

∴Tr Gtþ1TMGtþ1

� �−2Tr NTGtþ1

� �þ λ

Xdi¼1

gitþ1

��� ���1=22

≤Tr GtTMGt

� �−2Tr NTGt

� �þ λ

Xdi¼1

git��� ���1=2

2

∴Tr Gtþ1TMGtþ1

� �−2Tr NTGtþ1

� �þ λ Gtþ1

�� ��1=22;1=2≤Tr Gt

TMGt

� �−2Tr NTGt

� �þ λ Gtk k1=22;1=2

This proof indicates that the objective function value of Eq. (17)monotonically decreases until converging to the optimal G withAlgorithm 1.

References

[1] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Second ed. Wiley-Interscience,New York, USA, 2001.

[2] I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, Proc. ECML,1994, pp. 171–182.

[3] F. Wu, Y. Yuan, Y.T. Zhuang, Heterogeneous feature selection by group lasso with lo-gistic regression, Proc. ACM Multimedia, 2010, pp. 983–986.

[4] Y. Yang, Z. Huang, Y. Yang, J.J. Liu, H.T. Shen, J. Luo, Local image tagging via graphregularized joint group sparsity, Pattern Recogn. 46 (2013) 1358–1368.

[5] Z.G. Ma, Y. Yang, F.P. Nie, J. Uijlings, N. Sebe, Exploiting the entire feature spacewith sparsity for automatic image annotation, Proc. ACM Multimedia, 2011,pp. 283–292.

[6] Z.G. Ma, F.P. Nie, Y. Yang, J. Uijlings, N. Sebe, A. Hauptmann, Discriminating joint fea-ture analysis for multimedia data understanding, IEEE Trans. Multimed. 14 (6)(2012) 1662–1672.

[7] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. 58(1996) 267–288.

[8] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, Proc.ACM SIGKDD, 2010, pp. 333–342.

[9] S. Foucart, M.J. Lai, Sparest solutions of underdetermined linear systems vialq-minimization for 0 b q b 1, Appl. Comput. Harmon. Anal. 26 (3) (2009)395–407.

[10] D. Krishnan, R. Fergus, Fast image deconvolution using hyper-Laplacian priors, Neu-ral Information Processing Systems, MIT Press, Cambridge, MA, 2009.

[11] R. Chartrand, Exact reconstruction of sparse signals via nonconvex minimizaion,IEEE Signal Process. Lett. 14 (10) (2007) 707–710.

[12] R. Chartrand, Fast algorithms for nonconvex compressive sensing: MRI recon-struction from very few data, Proc. IEEE Int. Symp. Biomed. Imag, 2009,pp. 262–265.

[13] Z.B. Xu, X.Y. Chang, F.M. Xu, H. Zhang, L1/2 regularization: a thresholding represen-tation theory and a fast solver, IEEE Trans. Neural Netw. Learn. 23 (7) (2012)1013–1027.

[14] Z.B. Xu, H. Zhang, Y. Wang, X.Y. Chang, Y. Liang, L1/2 regularizer, Sci. China 53 (6)(2010) 1159–1169.

[15] F.P. Nie, H. Huang, X. Cai, C. Ding, Efficient and robust feature selection via jointL2,1-norms minimization, Proc. NIPS, 2010, pp. 1813–1821.

[16] Z.G. Ma, Y. Yang, Y. Cai, N. Sebe, A.G. Hauptmann, Knowledge adaptation for ad hocmultimedia event detection with few exemplars, Proc. ACMMM, 2012, pp. 469–478.

[17] L.P. Wang, S.C. Chen, l2, p − Matrix Norm and Its Application in Feature Selection,http://arxiv.org/abs/1303.3987 2013.

[18] Y. Yang, Z.G. Ma, A.G. Hauptmann, N. Sebe, Feature selection for multimedia analysisby sharing information among multiple tasks, IEEE Trans. Multimed. 15 (3) (2013)661–669.

[19] Y. Yang, F. Wu, F.P. Nie, H.T. Shen, Y.T. Zhuang, A.G. Hauptmann, Web and personalimage annotation by mining label correlation with relaxed visual graph embedding,IEEE Trans. Image Process. 21 (3) (2012) 1339–1351.

[20] J.H. Tang, R.C. Hong, S.C. Yan, T.S. Chua, Image annotation by kNN-sparsegraph-based label propagation over noisily-tagged web images, ACM Trans. Intell.Syst. Technol. 1 (1) (2010) 111–126.

[21] W.Y. Lee, L.C. Hsieh, G.L. Wu, W. Hsu, Graph-based semi-supervised learning withmulti-modality propagation for large-scale image datasets, J. Vis. Commun. ImageRepresent. 24 (2013) 295–302.

[22] X. Zhu, Semi-supervised learning literature survey, Technical Report, 1530, Univer-sity of Wisconsin, Madison, 2007.

[23] T. Jebara, J. Wang, S.F. Chang, Graph construction and b-matching for semi-supervised learning, Proc. ICML, 2009, pp. 441–448.

[24] X.J. Wang, L. Zhang, X. Li, W.Y. Ma, Annotating images by mining image searchresults, IEEE Trans. Pattern Anal. 30 (11) (2008) 1919–1932.

[25] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, NUS-WIDE: a real-world web imagedataset from National University of Singapore, Proc. CIVR, 2009, pp. 1–9.

[26] H. Li, M. Wang, X. Hua, MSRA-MM2.0: a large-scale web multimedia dataset, Proc.ICDMW, 2009, pp. 164–169.

[27] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with local and globalconsistency, Proc. NIPS, 2003, pp. 1–8.

[28] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using gaussian fieldsand harmonic functions, Proc. ICML, 2003.

[29] F.P. Nie, D. Xu, T. Hung, C. Zhang, Flexible manifold embedding: a framework forsemi-supervised and unsupervised dimension reduction, IEEE Trans. Image Process.7 (19) (2010) 1921–1932.

[30] H. Tong, J. He, M. Li, W.Y. Ma, H.J. Zhang, C. Zhang, Manifold-ranking-based keywordpropagation for image retrieval, EURASIP J. Appl. Signal Process. (2006) 190.

[31] Y. Yang, Y.T. Zhuang, F. Wu, Y.H. Pan, Harmonizing hierarchical manifolds for multi-media document semantics understanding and cross-media retrieval, IEEE Trans.Multimed. 10 (3) (2008) 437–446.

[32] Z. Zha, T. Mei, J. Wang, Z. Wang, X.-S. Hua, Graph-based semi-supervised learningwith multiple labels, J. Vis. Commun. Image Represent. 2 (20) (2009) 97–103.

[33] M. Wang, X.-S. Hua, J.H. Tang, R. Hong, Beyond distance measurement: constructingneighborhood similarity for video annotation, IEEE Trans. Multimed. 3 (11) (2009)465–476.

[34] R. Ando, T. Zhang, A framework for learning predictive structures from multipletasks and unlabeled data, J. Mach. Learn. Res. (6) (2005) 1817–1853.

[35] Y. Amit, M. Fink, N. Srebro, S. Ullman, Uncovering shared structures in multiclassclassification, Proc. ICML, 2007, pp. 17–24.

[36] J. Chen, L. Tang, J. Liu, J. Ye, A convex formulation for learning shared structuresfrom multiple tasks, Proc. ICML, 2009, pp. 137–144.

Page 13: Sparse feature selection based on graph Laplacian for web image annotation

201C. Shi et al. / Image and Vision Computing 32 (2014) 189–201

[37] S. Ji, L. Tang, S. Yu, J. Ye, A shared-subspace learning framework for multi-labelclassification, ACM Trans. Knowl. Discov. Data 2 (4) (2010) 1–29.

[38] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric frameworkfor learning from labeled and unlabeled examples, J. Mach. Learn. Res. (12) (2006)2399–2434.

[39] G. Cawley, N. Talbot, M. Girolami, Sparse multinomial logistic regression viabayesian L1 regularisation, Proc. NIPS, 2006, pp. 209–216.

[40] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. (27) (2006)861–874.

[41] D. Hand, R. Till, A simple generalization of the area under the ROC curve for multipleclass classification problems, Mach. Learn. 2 (45) (2001) 171–186.

[42] D.D. Lewis, Y. Yang, T.G. Rose, F. Li, RCV1: a new benchmark collection for textcategorization research, J. Mach. Learn. Res. 5 (2004) 361–397.