Sparse feature selection based on L2,1/2-matrix norm for web image annotation

Sparse feature selection based on L2,1/2-matrix norm for webimage annotation

Caijuan Shi a,b,c,n, Qiuqi Ruan a,c, Song Guo a,c, Yi Tian a,c

a Institute of Information Science, Beijing Jiaotong University, Beijing 100044, Chinab College of Information Engineering, Hebei United University, Tangshan 063009, Chinac Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China

a r t i c l e i n f o

Article history:Received 27 October 2013Received in revised form19 March 2014Accepted 15 September 2014Communicated by Jinhui Tang

Keywords:Sparse feature selectionl2,1/2-matrix normWeb image annotationShared subspace learning

a b s t r a c t

Web image annotation based on sparse feature selection has received an increasing amount of interest inrecent years. However, existing sparse feature selection methods become less effective and efficient. Thisraises an urgent need to develop good sparse feature selection methods to improve web imageannotation performance. In this paper we propose a novel sparse feature selection framework for webimage annotation, namely Sparse Feature Selection based on L2,1/2-matrix norm (SFSL). SFSL can selectmore sparse and more discriminative features by exploiting the l2,1/2-matrix norm with shared subspacelearning, and then improve the web image annotation performance. We proposed an efficient iterativealgorithm to optimize the objective function. Extensive experiments are performed on two web imagedatasets. The experimental results have validated that our method outperforms the state-of-the-artalgorithms and suits for large-scale web image annotation.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

Recent years many photo-sharing websites, such as Flickr andPicasa have increased continuously and the web images haveshown the explosive growth. Web image annotation has becomea critical research issue for image search and index [1–3]. As animportant technique for image annotation, feature selection playsan important role in improving annotation performance. Con-fronting the large number of web images, classical feature selec-tion algorithms such as Fisher Score [4] and ReliefF [5] become lesseffective and efficient because they evaluate the importance ofeach feature individually. Due to the efficiency and effectiveness,sparse feature selection has received an increasing amount ofinterest in recent years for web image annotation [6–9].

During the last decade, several endeavors have been madetowards this research topic. The most well-known sparse model isl1-norm (lasso) [10], which has been applied to select sparsefeatures for image classification [11] and multi-cluster data [12].Sparse feature selection based on l1-norm can select the sparsefeatures for its computational convenience and efficiency, butthese selected features are not sufficiently sparse sometimes

resulting in the higher computational cost. Much works [13–16]have extended the l1-norm to the lp-norm. Guo et al. [17] haveused lp-norm regularization for robust face recognition. In [18,19],Xu et al. have proposed l1/2-norm regularization and have pointedout l1/2-norm regularization has the best performance among alllp-norm regularization with p in (0, 1). However, all above methodsselect features one by one and neglect the useful information ofthe correlation between different features leading to the decline ofannotation performance. In [20], Nie et al. have introduced a jointl2,1-norm minimization on both loss function and regularizationfor feature selection, which can realize sparse features selectionacross all data points. In [8,21], Ma et al. have applied l2,1-norm totheir sparse feature selection model for image annotation. In[22,23], Li et al. have applied l2,1-norm to their unsupervisedfeature selection using nonnegative spectral analysis. Because lp-norm (0opo1) has better sparsity than l1-norm, Wang et al. [24]recently have proposed an idea to extend l2,1-norm to l2,p-matrixnorm (0opr1) so as to select joint, more sparse features, at thesame time l2,p-matrix norm has better robustness than l2,1-norm.In this paper we propose a sparse feature selection method basedupon l2,1/2-matrix norm to obtain more sparse, robust features forweb image annotation.

Usually, each web image always associated with severalsemantic concepts, so the web image annotation is actually amulti-label classification problem. This intrinsic characteristic ofweb images makes web image annotation more complicated.

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.09.0230925-2312/& 2014 Elsevier B.V. All rights reserved.

n Corresponding author.E-mail addresses: [email protected] (C. Shi),

[email protected] (Q. Ruan), [email protected] (S. Guo),[email protected] (Y. Tian).

Please cite this article as: C. Shi, et al., Sparse feature selection based on L2,1/2-matrix norm for web image annotation, Neurocomputing(2014), http://dx.doi.org/10.1016/j.neucom.2014.09.023i

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

www.sciencedirect.com/science/journal/09252312

www.elsevier.com/locate/neucom

http://dx.doi.org/10.1016/j.neucom.2014.09.023



mailto:[email protected]








Ando et al. have assumed that there is a shared subspace betweenlabels [25] and the shared subspace can utilize the label correla-tion to improve the performance of multi-label classification. In[22], Li et al. have exploited the latent structure shared by differentfeatures to predict the cluster indicators. Authors in Ref. [28–30]have also introduced some other related works of shared subspacelearning. In this paper, we apply the shared subspace learning tosparse feature selection for exploiting the relational information offeatures to enhance the web image annotation performance.

In this paper, we proposed a new sparse feature selectionframework for web image annotation and name it Sparse FeatureSelection based on L2,1/2-matrix norm (SFSL). SFSL can select moresparse and discriminative features by exploiting l2,1/2-matrix normmodel with shared subspace learning. We have tested the perfor-mance of our algorithm on two real-world web image datasetsNUS-WIDE [26] and MSRA MM 2.0 [27]. The results of theexperiments demonstrate that our algorithm SFSL outperformsother existing sparse feature selection algorithms for web imageannotation.

The main contributions of this paper are summarized as follows:

● Spare Feature Selection based on L2,1/2-matrix norm (SFSL) isproposed, which can select the more sparse and discriminativefeatures with good robustness based upon l2,1/2-matrix normwith shared subspace learning;

● We devise a novel effective algorithm for optimizing theobjective function of the SFSL and prove the convergence ofthe algorithm;

● We conduct several experiments on two web image datasetsand the results demonstrate the effectiveness and efficiency ofour method.

This paper is organized as follows. We briefly introduce therelated works on sparse feature selection model and sharedsubspace learning in Section 2. Then we describe the proposedSFSL algorithm and the optimization followed by the convergenceanalysis in Section 3. We conduct extensive experiments toevaluate the effectiveness and efficiency of our method for webimage annotation in Section 4 and followed by the conclusion inSection 5.

2. Related work

In this section, we discuss several sparse feature selectionmodels, especially the l2,p-matrix norm model. Besides, we alsobriefly review on shared subspace learning.

2.1. Sparse feature selection model

Compared with traditional feature selection algorithms, sparsefeature selection model can select the most discriminative featuresand simultaneously reduce the computational cost based ondifferent sparse models. Here we introduce some sparse featureselection models which are based on l1-norm, l1/2-norm, l2,1-normand l2,p-matrix norm respectively.

Denote X ¼ x1; x2;…; xn½ �T as the feature matrix of trainingimages, where xiARdð1r irnÞ is the ith datum and n is thenumber of the training images. Y ¼ y1; y2;…; yn

� �TAf0;1gn�cis the

label matrix of training images where c is the number of classesand yiARcð1r irnÞ is the ith label vector.

Let GARd�cbe the projection matrix. Here we apply supervisedlearning algorithm into sparse feature selection model to learn theprojection matrix G. A generally principled framework to obtain G

is to minimize the following regularized error

minG

lossðGTX;YÞþαRðGÞ ð1Þ

where lossðU Þ is the loss function and αRðGÞ is the regularizationwith α as its regularization parameter.

2.2. 1 l1-norm model

Though the l0-norm theoretically can obtain the sparsestsolution, it has been proven to be an NP-hard selection problem.In practice, the l1-norm is usually used to reformulate sparsefeature selection as a convex problem. We use the traditional leastsquare regression as the loss function and l1-norm as the regular-ization to solve the optimization problem in (1), and then theprojection matrix GARd�ccan be obtained as follow

minG

‖XTG�Y‖22þα‖G‖1 ð2Þ

2.3. 2 l1/2-norm model

Many works have extended the l1-norm regularization to thelp-norm (0opo1) regularization because the solution of thelp(0opo1) regularization is more sparse than that of the l1regularization. Xu et al. [18,19] have proposed the l1/2-normregularization is most sparse with the best performance when pbelongs to (0, 1). Then the sparse feature selection model with thel1/2-norm regularization can be written as follow

minG

‖XTG�Y‖22þα‖G‖1=2 ð3Þ

2.4. 3 l2,1-norm model

In [20] Nie et al. have introduced a joint l2,1-norm minimizationon both loss function and regularization for feature selection,which can realize sparse features selection across all data points.Moreover, the l2,1-norm minimization on both loss function andregularization can overcome the sensitiveness of the square-normresidual. The optimization problem with the l2,1-norm is

minG

‖XTG�Y‖2;1þα‖G‖2;1 ð4Þ

2.5. 4 l2,p-norm model

Because l2,1-norm is constructed on the convex l1-norm framework,it has not better sparsity. Considering the sparsity of lp-norm, Wanget al. [24] have extended the l2,1-norm to l2,p-matrix norm model formore effective sparse feature selection with good robustness.

The definition of l2,p-matrix norm is:

‖G‖2;p ¼ ð ∑d

i ¼ 1‖gi‖p2Þ1=p pA 0;1ð � ð5Þ

The spare feature selection model based on l2,p-matrix normcan be written as

minG

‖XTG�Y‖p2;pþα‖G‖p2;p ð6Þ

When p¼1, the l2,p-matrix norm is reduced to the case of l2,1-norm. When p is belongs to (0, 1), the l2,p (0opo1) matrix normbecomes a better sparse model than l2,1-norm because lp-norm canfind sparser solution than l1-norm [24].

For any p belongs to (0, 1), the noise magnitude of distantoutlier in (6) is no more than that in (4). Thus the model (6) isexpected to be more robust than model (4) [24].

When p¼1, the l2,p-matrix norm i.e. l2,1-norm is convex. But whenp is belongs to (0, 1), because lp-norm is neither convex nor Lipschitz

C. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎2





continuous, l2,p-matrix pseudo norm is not convex or Lipschitzcontinuous. Wang et al. have presented a unified algorithm to solvethe involved l2,p-matrix norm for all p belongs to (0, 1).

2.6. Shared subspace learning

In multi-label learning problems, Ando et al. have assumed thatthere is a shared subspace between labels [25]. Similarly, in featureselection problems, there is a shared space between features. Therepresentation in the original feature space and the representationin the shared subspace are used simultaneously to predict theconcepts of an image.

Given a feature vector xARdand a prediction function y, theshared subspace learning can be defined as following:

y¼ vTxþ lTUTx ð7Þwhere vARdand lARr are the weight vectors and UARd�r is theshared subspace for features.

Denote V ¼ v1; v2;…; vc½ �ARd�c, L¼ l1; l2;…; lc� �

ARr�c and

UARd�r , where r is the dimension of the shared subspace. By

defining G¼ VþUL where GARd�c, the principled framework ofsupervised learning with shared subspace learning can be written as

minV ;L;U

lossðGTX;YÞþαRðV ; LÞ

s:t: UTU ¼ I ð8Þ

Note that the imposed constraint UTU ¼ I in (8) makes theproblem tractable.

In this paper, we proposed a novel sparse feature selectionframework which can select the most discriminating featureswith good robustness based on l2,1/2-matrix norm model withshared subspace learning to boost the web image annotationperformance.

3. Sparse feature selection based on L2,1/2-matrix norm (SFSL)

In this section, we proposed a novel sparse feature selectionframework for web image annotation, namely Sparse FeatureSelection based on L2,1/2-matrix norm (SFSL). First we introducedthe SFSL formulation and then a novel effective algorithm foroptimizing the objective function is introduced. Finally we presentthe convergence analysis of the SFSL algorithm.

3.1. SFSL formulation

In [18,19], Xu et al. have pointed out lp-norm (0opo1) regular-ization has the better sparsity than l1-norm regularization, especiallyl1/2-norm regularization has the best performance among all with pin (0, 1). l2,1-norm has been widely used in sparse feature selectionbecause it can realize the sparse feature selection across all datapoints, but l2,1-norm is constructed on the convex l1-norm frame-work, whose sparsity is worse than lp-norm (0opo1). In [24], Wanget al. have extended the l2,1-norm to a mixed l2,p (0opr1) matrixnorm for the better sparsity and robustness. Typical p belongs to(0, 1) are tested in l2,p-matrix norm based objective functions andtheir experiment results show that p¼0.5 obviously outperformsp¼1. Taking into account the better sparsity and the robustness ofl2,1/2-matrix norm, we apply l2,1/2-matrix norm to our sparse featureselection framework SFSL in our paper.

Set p¼1/2 in (6), the l2,1/2-matrix norm can be defined as:

‖G‖2;1=2 ¼ ð ∑d

i ¼ 1‖gi‖1=22 Þ2 ð9Þ

Then the spare feature selection model based on l2,1/2-matrixnorm can be written as

minG

‖XTG�Y‖1=22;1=2þα‖G‖1=22;1=2 ð10Þ

Note the l2,1/2-matrix norm is non-convex or Lipschitz continuous.Moreover, in order to exploit the relational information

between different features, we introduce the shared subspacelearning into our sparse feature selection framework SFSL.

The basic idea of our method SFSL is to combine the l2,1/2-matrix norm model with the shared subspace learning to realizethe sparse feature selection. By integrating the sparse featureselection model based on l2,1/2-matrix norm as given in (10) andthe shared subspace learning as given in (8) into a frame, weproposed the following objective function of SFSL.

arg minG;L;U

‖XTG�Y‖1=22;1=2þλ‖G‖1=22;1=2þμ‖G�UL‖2F

s:t: UTU ¼ I ð11Þ

In (11) λ‖G‖1=22;1=2 and μ‖G�UL‖2F are two regularization terms,

where ‖U‖2Fdenotes the Frobenius norm of a matrix, λ40 andμ40 are regularization parameters. The regularization term

‖G‖1=22;1=2guarantees our model select the most sparse and discri-

minative features across all data points. The regularization term‖G�UL‖2F makes our model can select the more discriminativefeatures with shared subspace learning by considering the correla-

tion between different features. The loss function ‖XTG�Y‖1=22;1=2

in our objective (11) has better robustness to outliers because thel2,1/2-matrix norm is more robust than the l2,1-norm. To sum up,our method SFSL based on l2,1/2-matrix norm and shared subspacelearning can select the most sparse and discriminative featureswith better robustness.

3.2. Optimization

Our problem in (11) involves the l2,1/2-matrix norm whichmakes (11) non-convex. Here we proposed an effective iterativeapproach to solve the problem in (11) as follows.

Given XTG�Y ¼ z1;…; zn� �T

;G¼ g1;…; gd� �T

and note D � and D

as two diagonal matrices whose diagonal elements are ~Dii ¼ 1=4

‖zi‖3=22 and Dii ¼ 1=4‖gi‖3=22 respectively. According to [24], ‖G‖p2;p ¼2=pTrðGTDGÞ, Then we have ‖XTG�Y‖1=22;1=2 ¼ 4TrððXTG�YÞTD �

ðXTG�YÞÞ and ‖G‖1=22;1=2 ¼ 4TrðGTDGÞ.The Frobenius norm‖G�UL‖2F is equivalent to TrððG�ULÞT

ðG�ULÞÞ.So the objective in (11) is equivalent to

arg minG;L;U

4TrððXTG�YÞT D�ðXTG�YÞÞþ4λTrðGTDGÞþμTrððG�ULÞT ðG�ULÞÞ

s:t: UTU ¼ I: ð12ÞBy setting the derivative of (12) w.r.t L to zero, we get

μð2UTUL�2UTGÞ ¼ 0 ) L¼UTG ð13ÞSubstituting L in (12) with (13), the objective function becomes

arg minG;U

4TrððXTG�YÞT ~DðXTG�YÞÞþTrðGT ð4λDþμI�μUUT ÞGÞ

s:t: UTU ¼ I ð14ÞBy setting the derivative of (14) w.r.t G to zero, we have

G¼ 4B�1X ~DY ð15Þwhere B¼ A�μUUT ;A¼ 4X ~DXT þ4λDþμI.

C. Shi et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3





By applying the G obtained with (15), the objective function canbe rewritten as

arg minU

4TrðYT ~DYÞ�4Trð4YT ~DXTB�1X ~DYÞ

s:t: UTU ¼ I ð16ÞUltimately, the above objective function arrives at

arg maxU

TrðUT JUÞ�1UTKU

s:t: UTU ¼ I ð17Þwhere J ¼ I�μA�1; K ¼ A�1XD � YYTD � XTA�1.

Eq. (17) can be transformed into an eigen-decomposition ofJ�1K . However, as it is not directly to solve U and G from (17), thenwe propose an iterative approach in Algorithm 1 to solve theobjective function of SFSL.

Algorithm 1. The SFSL algorithm.Input: The training image feature matrix XARd�n, the trainingimage labels matrix YARn�c .Output: optimized projected matrix G.

1: set t¼0 and initialize G0ARd�crandomly.2: repeatCompute z1;⋯; zn

� �T ¼ XTG�Y;Compute the diagonal matrix ~DtandDt;

Compute At ¼ 4X ~DtXT þ4λDtþμI;

Compute Jt ¼ I�μAt�1;

Compute Kt ¼ At�1X ~DtYY

T ~DtXTAt

�1;Obtain Utby the eigen-decomposition ofJt

�1Kt;Compute Bt ¼ At�μUtUt

T ;

Update Gtþ1 ¼ 4Bt�1X ~DtY;

t¼tþ1.until Convergence;3: Return G.

3.3. Convergence analysis

This section shows the convergence analysis of the proposediterative approach in Algorithm 1. The iterative approach can beverified to converge to the optimal G in Theorem 1 and thedetailed proof is given in the appendix. Before that we introducetwo lemmas which are essential to Theorem 1.

Lemma 1. If φðtÞ ¼ 4t�t4�3, then for any t40, φðtÞr0.

Proof. Taking derivative of φ(t) with respect to t, and setting it tozero, that is φ'ðtÞ ¼ 4�4t3 ¼ 0

Then we have the unique stationary point t¼1 on (0, þ1). Itcan be easily proved that t¼1 is just the maximum point. HenceφðtÞrφð1Þ ¼ 0; t40. &

Lemma 2. Denote Gt as the optimal result of the tth iteration andGtþ1 as the variable of (tþ1)th iteration of Algorithm 1, at the sametime, suppose that git and gitþ1are the ith row of Gt and Gtþ1

respectively, then the following inequality holds:

‖gitþ1‖1=22 �1

4‖gitþ1‖22‖gitþ1‖

3=22

r‖git‖1=22 �1

4‖git‖22‖git‖

3=22

; i¼ 1;…;m

Then we have the following conclusion:

∑m

i ¼ 1

‖gitþ1‖

1=22 �1


3=22

!r ∑

m

i ¼ 1

‖git‖

1=22 �1


3=22

!

Proof. for each row i, according to Lemma 1

Let t ¼ ‖git þ 1‖1=22

‖git‖1=22

in φ(t), then φ ‖git þ 1‖1=22

‖git‖1=22

� �r0, that is 4

‖git þ 1‖1=22

‖git‖1=22

�‖git þ 1‖

22

‖git‖22r3

Multiplying the two sides of above formula with 14‖g

it‖

1=22 , then

we have

‖gitþ1‖1=22 �1


3=22

r‖git‖1=22 �1


3=22

Thus, by summing over all the rows (1r irm) above inequal-ities we get the conclusion of Lemma 2.

∑m

i ¼ 1

‖gitþ1‖

1=22 �1


3=22

!r ∑

m

i ¼ 1

‖git‖

1=22 �1


3=22

!

&

Theorem 1. The objective function value shown in Eq. (11) mono-tonically decreases in each iteration until convergence using theiterative approach in Algorithm 1.

Proof. See the Appendix. &

4. Experiments

In this section, we will validate the efficacy of the proposedalgorithm for web image annotation.

We conduct several experiments on two large-scale web imagedatasets. The first one is the NUS-WIDE dataset created by the Lab forMedia Search in National University of Singapore in 2009 [26]. Thisdataset includes 269,648 real-world images belonging to 81 concepts.The second one is the MSRA-MM2.0 dataset created by MicrosoftResearch Asia in 2009 [27]. This datasets consists of 50,000 imagesbelonging to 100 concepts. In our experiments, we use three types ofvisual features and concatenate them into a long vector. In the NUS-WIDE dataset, 144-dimension normalized color correlogram, 128-dimension normalized wavelet texture and 73-dimension normalizededge direction histogram are concatenated into a 345-dimensionvector to represent each image. In the same way, in the MSRA-MM2.0 dataset, 144-dimension normalized color correlogram, 128-dimension normalized wavelet texture and 75-dimension normalizededge direction histogram are concatenated into a 347-dimensionvector to represent each image.

In our experiments, we randomly sample a training set comprisedof n� c images for each dataset. Denote c as the number of conceptsfor each dataset, and set n as 10 or 20 respectively. Thus we can obtainthe performance variation with the number of training data changingand report the corresponding results. The experiments are indepen-dently repeated ten times and we report the average results. Tworegularization parameters λ and μ in objective function (11) are tunedfrom {0.001, 0.01, 0.1, 1, 10, 100, 1000} and the best results are reported.The number of the selected features k is set to 50, 100, 150, 200, 250and 300 respectively. Typical parameter p, such as 0.1, 0.25, 0.5, 0.75,1 are tested in l2,p-matrix norm based optimization problems. Toevaluate the annotation performance, we use the well-known evalua-tion metric, Mean Average Precision (MAP), which is widely used formulti-label classification tasks and image annotation.

We compare our method SFSL with several state-of-the-artsparse feature selection algorithms for web image annotation. Thecompared methods include sparse multinomial logistic regressionvia Bayesian l1 regularization (SBMLR) [31], feature selection viajoint l2,1-norms minimization (FSNM) [20], sub-feature uncoveringwith sparsity (SFUS) [21] and feature selection via l2,p-matrix norm(0opr1) (FSL2p) [24].






4.1. Performance of SFSL

Fig. 1 shows some image annotation results by using ourmethod SFSL on NUS-WIDE dataset and MSRA-MM2.0 dataset.We randomly select five images from both datasets respectivelyand annotate each image with the top three or four concepts.

In Fig. 1, the five images on the top line come from NUS-WIDEdataset and the annotation results are correct except one conceptof the fourth image (red font). The five images on the second linecome from MSRA-MM2.0 dataset and the annotation results are

totally correct. These annotation results indicate that the SFSLalgorithm can annotate web images well.

Table 1 shows the numerical evaluation of the SFSL algorithmwith10� c and 20� c training data respectively. Ten times independentexperiments are conducted and the average results are reported. Thenumber of the selected features k is set to 150. Here the regularizationparameter λ is set to 0.01 and the regularization parameter μ is setto 10 on NUS-WIDE dataset, while λ is set to 0.01 and μ is set to 0.001on MSRA-MM 2.0 dataset. The results in bold indicate the bestperformance.

Fig. 1. Annotation results of some images from two datasets (The wrong concept is indicated by red font.).

Table 1Performance comparison (MAP7standard deviation).

dataset Number of training data SFSL SFUS FSNM FSL2p SMBLR

NUS 10� c 0.09970.001 0.09470.003 0.09270.001 0.09370.001 0.07270.00820� c 0.11270.001 0.10870.002 0.10570.003 0.10770.002 0.07370.007

MSRA 10� c 0.06570.001 0.06370.001 0.06170.002 0.06370.002 0.05670.00220� c 0.07270.001 0.07070.001 0.06870.001 0.07070.001 0.05970.001

50 100 150 200 250 3000.09

0.095

0.1

0.105

0.11

0.115

Number of Selected Features

MA

P

Performance variation w.r.t to the number of selected features

SFSLSFUSFSNMFSL2p

50 100 150 200 250 3000.05

0.055

0.06

0.065

0.07

0.075

0.08

Number of Selected Features

MA

P

Performance variation w.r.t to the number of selected features

SFSLSFUSFSNMFSL2p

Fig. 2. The annotation performance variation according to the number of selected features on two datasets. (a) NUS-WIDE dataset and (b) MSRA-MM dataset.






We have the following observations from Table 1: First, ourmethod SFSL has the best performance in terms of MAP with10� c and 20� c training data on the two datasets. This

observation indicates that the proposed SFSL can effectively selectthe most discriminative sparse features with good robustness forweb image annotation. Second, it can also be found that, as the

Fig. 3. Convergence curves of the objective function value. The figures show that the objective function values of method SFSL, SFUS, FSNM and FSL2p on two datasetsmonotonically decrease until convergence. (a) SFSL-NUS, (b) SFUS-NUS, (c) FSNM-NUS, (d) FSL2p-NUS, (e) SFSL-MSRA, (f) SFUS-MSRA, (g) FSNM-MSRA and (i) FSL2p –MSRA.






number of training images increases, the performance of all thealgorithms gradually improves. For example, for NUS-WIDE data-set, the MAP of our method SFSL is 0.099 with 810(10� c) trainingimages while 0.112 when the number of training images increasesto 1620(20� c). Our method SFSL achieves good annotationperformance with the small number of training images comparedto the large number of testing images, which indicates that SFSLhas the ability to annotate large-scale web images with compara-tively small amount of training images.

There are two main reasons for the good performance of SFSLfor web image annotation. First, SFSL can select the most dis-criminative sparse features with good robustness by using thesparse model l2,1/2-matrix norm. Second, shared subspace learningis exploited to consider the correlation between different featuresfor improving the annotation performance. The sparse model l2,1/2-matrix norm and shared subspace learning make our method SFSLcan select most discriminative features, and then enhance theperformance of web image annotation.

4.2. Influence of selected features

The number of selected features can affect the effectivenessand computational efficiency for web image annotation, so weperform an experiment to study the performance variation whenthe number of selected features is set to 50, 100, 150, 200, 250 and300 respectively. At the same time, we compare our method SFSLwith other methods such as SFUS, FSNM and FSL2p. Here 20� ctraining data are used on both datasets. Following the aboveexperiment, the regularization parameters λ and μ are set to0.01 and 10 respectively on NUS-WIDE dataset, while the regular-ization parameters λ and μ are set to 0.01 and 0.001 on MSRA-MM2.0 dataset. The results of this experiment are shown in Fig. 2.

Fig. 2 shows that the performance of our method SFSL andother methods varies when the number of selected featureschanges on both datasets. From Fig. 2 we have the followingobservations. (1) MAP increases with the number of selectedfeatures increasing before MAP arrives at the peak level whileMAP decreases with the number of selected features increasingafter MAP arrives at the peak level. This indicates some usefulinformation will be lost with too few selected features, while somenoise will be included with too many selected features. (2) TheMAP of SFSL is higher than other methods with the same numberof selected features, especially when the number of the selectedfeatures is smaller than that at the peak level. (3) MAP arrives atthe peak level with 150 features for method SFSL and methodFSL2p while with 200 features for method FSNM and methodSFUS. Method SFSL and FSL2p are based on l2,1/2-matrix norm,while method FSNM and SFUS are based on l2,1-matrix norm,which indicates l2,1/2-matrix norm has better sparsity than l2,1-matrix norm for feature selection. (4) The MAP of SFSL is higherthan that of FSL2p, while the MAP of SFUS is higher than that ofFSNM. Method SFSL and SFUS are based on shared subspacelearning, but method FSL2p and FSNM are not based on it. Thisindicates shared subspace learning can improve the performanceof feature selection. To sum up, this experiment indicates that ourmethod SFSL can achieve the best sparse feature selection perfor-mance with fewer selected features based on l2,1/2-matrix normand shared subspace learning, and then to enhance the web imageannotation performance.

4.3. Convergence analysis

As proved before, the objective function value shown in (11)monotonically decreases until convergence by using the iterativeapproach Algorithm 1. Here we conduct experiments on twodatasets to show the convergence, at the same time we compare

the convergence of our method SFSL with the convergence ofother methods such as SFUS, FSNM and FSL2p. In these experi-ments, 20� c training data of each dataset are used and thenumber of the selected features k is set to 150. We set theregularization parameters λ and μ to 0.01 and 10 respectively onNUS-WIDE dataset, while set λ and μ to 0.01 and 0.001 respec-tively on MSRA-MM 2.0 dataset. These experiments are conductedon a personal computer (2.90GHz, 4Gb RAM) with Matlab (2010b).The convergence curves are shown in Fig. 3.

Fig. 3 shows the convergence curves of our method SFSL,method SFUS, method FSNM and method FSL2p. For NUS-WIDEdataset, it can be observed from Fig. 3a–d that the objectivefunction value of our method SFSL converges via 3 iterations whilethe objective function values of other methods converge via 5 or12 iterations. For MSRA dataset, we can observe from Fig. 3e–h–that the objective function value of our method SFSL converges via3 iterations while the objective function values of other methodsconverge via 5 or 10 iterations. From Fig. 3 we can conclude thatthe convergence of our method SFSL has been proven experimen-tally. Meanwhile, the experimental results illustrate the conver-gence of our method SFSL is superior to those of SFUS, FSNM andFSL2p, which indicates that our method SFSL has higher efficiencyin selecting the discriminative features, and then indicates SFSLmethod is capable to annotate large-scale web images.

4.4. Influence of parameter p

l2,1/2-matrix norm used in our method is a special case of the l2,p-matrix norm. In order to demonstrate the effect of parameter pand show our method SFSL has the best performance, typical pbelonging to (0, 1) are tested in this experiment. We set p to 0.1,0.25, 0.5, 0.75 and 1 respectively in the optimization problem.Following the above experiment, 20� c training data are used onboth datasets. The number of the selected features k is set to 50,100, 150, 200, 250 and 300 respectively. The regularizationparameters λ is set to 0.01 and μ is set to 10 on NUS-WIDEdataset, while λ is set to 0.01 and μ is set to 0.001 on MSRA-MM2.0 dataset. Tables 2 and 3 show the annotation results withdifferent parameter p on NUS-WIDE dataset and MSRA-MM2.0dataset. The results in bold indicate the best performance corre-sponding to the number of selected features.

Table 2Performance comparison with different parameter p (NUS-WIDE dataset).

MAP p¼0.1 p¼0.25 p¼0.5 p¼0.75 p¼1

d¼50 0.082 0.094 0.109 0.095 0.100d¼100 0.083 0.095 0.110 0.093 0.102d¼150 0.088 0.100 0.112 0.094 0.106d¼200 0.084 0.097 0.108 0.090 0.108d¼250 0.083 0.093 0.106 0.086 0.105d¼300 0.081 0.092 0.104 0.084 0.103

Table 3Performance comparison with different parameter p (MSRA-MM dataset).

MAP p¼0.1 p¼0.25 p¼0.5 p¼0.75 p¼1

d¼50 0.031 0.038 0.070 0.058 0.053d¼100 0.032 0.040 0.071 0.060 0.058d¼150 0.034 0.044 0.072 0.063 0.065d¼200 0.034 0.050 0.071 0.062 0.069d¼250 0.033 0.042 0.070 0.064 0.069d¼300 0.032 0.041 0.070 0.062 0.068






This experiment indicates that different parameter p results indifferent annotation performance. From Tables 2 and 3 we havethe following observations. (1) No matter how many feature areselected, the best performance is achieved when p is 0.5. (2) Whenp belongs to (0, 0.5), the closer to 0 the p is, the worse theperformance is. This indicates the representation is too sparse thatmuch useful information is lost. (3) If p is near to 1, the model isalmost SFUS. In a word, the results of this experiment indicate thatthe method SFSL (p¼0.5) can select the most discriminativefeatures and has the best ability to annotate large-scale webimages.

4.5. Regularization parameters analysis

There are two regularization parameters λ and μ in (11). Anexperiment is conducted on the parameter sensitivity to learn howthey affect the web image annotation performance. The tworegularization parameters are set to 0.001, 0.01, 0.1, 1, 10, 100 and1000 respectively. In this experiment, 20� c training data are usedand the number of the selected features k is set to 150 on bothdatasets. Fig. 4 demonstrates the MAP variation with λ and μ onthe two datasets.

From Fig. 4 we obtain that the annotation performance changescorresponding to different combinations of λ and μ. When λ¼0.01and μ¼10, the annotation performance is best on NUS-WIDEdataset, while when λ¼0.01 and μ¼0.001 the annotation perfor-mance is best on MSRA-MM 2.0 dataset. The annotation perfor-mance becomes poor with other combinations of λ and μ. So in theother experiments in our paper, we set the regularization para-meters λ and μ to 0.01 and 10 respectively on NUS-WIDE dataset,while set λ and μ to 0.01 and 0.001 respectively on MSRA-MM2.0 dataset.

5. Conclusion

In this paper we have proposed a novel sparse feature selectionmodel SFSL for web image annotation. The SFSL model can selectmore sparse and discriminative features with good robustnessbased on l2,1/2-matrix norm sparse model with shared subspacelearning. We have introduced an effective algorithm for optimizing

the objective function of the SFSL and have proven the conver-gence of the algorithm. Some experiments are conducted on twopopular web image datasets. The experimental results demon-strate that our algorithm outperforms state-of-the-art sparsefeature selection algorithms in the effectiveness and efficiency,which indicates that our method is suitable for large-scale webimage annotation.

Acknowledgments

This work was supported partly by the National Natural ScienceFoundation of China (61172128, 61003114), National Key Basic ResearchProgram of China (2012CB316304), the Fundamental Research Fundsfor the Central Universities (2013JBM020, 2013JBZ003), Program forInnovative Research Team in University of Ministry of Education ofChina (IRT201206) and Doctoral Foundation of China Ministry ofEducation (20120009120009).

Appendix. Proof of Theorem 1

Proof. According to Algorithm 1, it can be inferred from (12) that

Gtþ1 ¼ arg min TrððXTG�YÞT D�ðXTG�YÞÞþλTrðGTDGÞþμ‖G�UL‖2F

s:t: UTU ¼ I:

Therefore, we have

TrððXTGtþ1�YÞTDt

�ðXTGtþ1�YÞÞþλTrðGtþ1

TDtGtþ1Þþμ‖Gtþ1

�Utþ1Ltþ1‖2F rTrððXTGt�YÞTDt

�ðXTGt�YÞÞþλTrðGt

TDtGtÞþμ‖Gt�UtLt‖2F

_TrðYtTDtYtÞ ¼ ∑

m

i ¼ 1

‖yit‖224‖yit‖

3=22

; TrðYtþ1TDtYtþ1Þ ¼ ∑

m

i ¼ 1

‖yitþ1‖224‖yit‖

3=22

‘ ) ∑n

i ¼ 1

‖xTi Gtþ1�yi‖224‖xTi Gt�yi‖

3=22

þλ ∑d

i ¼ 1

‖gitþ1‖22

4‖git‖3=22

þμ‖Gtþ1�Utþ1Ltþ1‖2F

Fig. 4. MAP variation with λ and μ on the two datasets. The figure shows different parameters λ and μ result in different annotation results. (a) NUS-WIDE dataset and(b) MSRA dataset.






r ∑n

i ¼ 1

‖xTi Gt�yi‖224‖xTi Gt�yi‖

3=22

þλ ∑d

i ¼ 1

‖git‖224‖git‖

3=22

þμ‖Gt�UtLt‖2F

) ∑n

i ¼ 1‖xTi Gtþ1�yi‖

1=22 � ∑

n


1=22 þ ∑

n

i ¼ 1


3=22

þλ ∑d

i ¼ 1‖gitþ1‖

1=22 �λ ∑

d

i ¼ 1‖gitþ1‖

1=22 þλ ∑

d

i ¼ 1

‖gitþ1‖224‖git‖

3=22

þμ‖Gtþ1

�Utþ1Ltþ1‖2F r ∑n

i ¼ 1‖xTi Gt�yi‖

1=22 � ∑

n


1=22

þ ∑n

i ¼ 1


3=22

þλ ∑d

i ¼ 1‖git‖

1=22 �λ ∑

d

i ¼ 1‖git‖

1=22

þλ ∑d

i ¼ 1


3=22

þμ‖Gt�UtLt‖2F

) ∑n


1=22 þλ ∑

d

i ¼ 1‖gitþ1‖

1=22 þμ‖Gtþ1�Utþ1Ltþ1‖2F

�

∑n


1=22 � ∑

n

i ¼ 1


3=22

!

�λ

∑d

i ¼ 1‖gitþ1‖

1=22 � ∑

d

i ¼ 1

‖gitþ1‖224‖git‖

3=22

!

r ∑n


1=22 þλ ∑

d

i ¼ 1‖git‖

1=22 þμ‖Gt�UtLt‖2F

�

∑n


1=22 � ∑

n

i ¼ 1


3=22

!

�λ

∑d

i ¼ 1‖git‖

1=22 � ∑

d

i ¼ 1


3=22

!

It has been proven in Lemma 2 that:

∑m

i ¼ 1

‖gitþ1‖

1=22 �1

4‖gitþ1‖

22

‖gitþ1‖3=22

!r ∑

m

i ¼ 1

‖git‖

1=22 �1


3=22

!

Then

) ∑n


1=22 þλ ∑

d

i ¼ 1‖gitþ1‖

1=22 þμ‖Gtþ1�Utþ1Ltþ1‖2F

r ∑n


1=22 þλ ∑

d

i ¼ 1‖git‖

1=22 þμ‖Gt�UtLt‖2F

) ‖XTGtþ1�Y‖1=22;1=2þλ‖Gtþ1‖1=22;1=2þμ‖Gtþ1�Utþ1Ltþ1‖2F

r‖XTGt�Y‖1=22;1=2þλ‖Gt‖1=22;1=2þμ‖Gt�UtLt‖2F

This proof indicates that the objective function value of Eq. (11)monotonically decreases until converging to the optimal G withAlgorithm 1. &

References

[1] J.H. Tang, R.C. Hong, S.C. Yan, T.S. Chua, Image annotation by kNN-sparsegraph-based label propagation over noisily-tagged web images, ACM Trans.Intell. Syst. Technol. 1 (1) (2010) 111–126.

[2] Y. Yang, F. Wu, F.P. Nie, H.T. Shen, Y.T. Zhuang, A.G. Hauptmann, Web andpersonal image annotation by mining label correlation with relaxed visualgraph embedding, IEEE Trans. Image Process. 21 (3) (2012) 1339–1351.

[3] Y. Yang, Z. Huang, Y. Yang, J.J. Liu, H.T. Shen, J. Luo, Local image tagging viagraph regularized joint group sparsity, Pattern Recogn. 46 (2013) 1358–1368.

[4] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley-Interscience, NewYork, USA, 2001.

[5] I. Kononenko, Estimating attributes: analysis and extensions of RELIEF, In:Proceedings ECML, 1994, 171–182.

[6] F. Wu, Y. Yuan, Y. Zhuang, Heterogeneous feature selection by group lasso withlogistic regression, in: Proceedings ACM Multimedia., 2010, pp. 983–986.

[7] F. Wu, Y. Yuan, Y. Rui, S. Yan, Y. Zhuang, Annotating web images using NOVA:non-convex group sparsity, in: Proceedings of the ACM Multimedia, 2012,pp. 509–518.

[8] Z.G. Ma, Y. Yang, F.P. Nie, J. Uijlings, N. Sebe, Exploiting the entire feature spacewith sparsity for automatic image annotation, Proceedings of the ACM Multi-media (2011) 283–292.

[9] Z.G. Ma, F.P. Nie, Y. Yang, J. Uijlings, N. Sebe, A. Hauptmann, Discriminatingjoint feature analysis for multimedia data understanding, IEEE Trans. Multi-media 14 (6) (2012) 1662–1672.

[10] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. 58(1996) 267–288.

[11] J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparsecoding for image classification, in: Proceedings of the CVPR, 2009, pp. 1794�1801.

[12] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data,in: Proceedings of the ACM SIGKDD, 2010, pp. 333–342.

[13] S. Foucart, M.J. Lai, Sparest solutions of underdetermined linear systems via lq-minimization for 0oqo1, Appl. Comput. Harmonic Anal. 26 (3) (2009) 395–407.

[14] D. Krishnan, R. Fergus, Fast image deconvolution using hyper-Laplacian priors,Neural Information Processing Systems, MIT Press, Cambridge, MA, 2009.

[15] R. Chartrand, Exact reconstruction of sparse signals via nonconvex minimi-zaion, IEEE Signal Process. Lett. 14 (10) (2007) 707–710.

[16] R. Chartrand, Fast algorithms for nonconvex compressive sensing: MRI reconstruc-tion from very few data, Proc. IEEE Int. Symp. Biomed. Imaging (2009) 262–265.

[17] S. Guo, Z.H. Wang, Q.Q. Ruan, Enhancing sparsity via lp (0opo1) minimiza-tion for robust face recognition, Neurocomputing 99 (2013) 592–602.

[18] Z.B. Xu, X.Y. Chang, F.M. Xu, H. Zhang, L1/2 Regularization: a thresholdingrepresentation theory and a fast solver, IEEE Trans. Neural Netw. Learn. 23 (7)(2012) 1013–1027.

[19] Z.B. Xu, H. Zhang, Y. Wang, X.Y. Chang, Y. Liang, L1/2 regularizer, Sci. China 53(6) (2010) 1159–1169.

[20] F. Nie, H. Huang, X. Cai, C. Ding, Efficient and robust feature selection via jointL2,1-norms minimization, Proc. NIPS (2010) 1813–1821.

[21] Z.G. Ma, F.P. Nie, Y. Yang, J.R.R. Uijlings, N. Sebe, Web image annotation viasubspace-sparsity collaborated feature selection, IEEE Trans. Multimedia 14(4) (2012) 1021–1030.

[22] Z.C. Li, J. Liu, Y. Yang, X.F. Zhou, H.Q. Lu, Clustering-guided sparse structurallearning for unsupervised feature selection, IEEE Trans. Knowl. Data Eng. 26(9) (2014) 2138–2150.

[23] Z.C. Li, Y. Yang, J. Liu, X.F. Zhou, H.Q. Lu, Unsupervised feature selection usingnonnegative spectral analysis, Proc. AAAI (2012) 1026–1032.

[24] L.P. Wang, S.C. Chen, l2,p�Matrix Norm and Its Application in Feature Selection(2013).

[25] R. Ando, T. Zhang, A framework for learning predictive structures frommultiple tasks and unlabeled data, J. Mach. Learn. Res. 6 (2005) 1817–1853.

[26] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, NUS-WIDE: a real-world webimage dataset from National University of Singapore, Proc. CIVR (2009) 1–9.

[27] H. Li, M. Wang, X. Hua, MSRA-MM2.0: A large-scale web multimedia dataset,Proc. ICDMW (2009) 164–169.

[28] S. Ji, L. Tang, S. Yu, J. Ye, A shared-subspace learning framework for multi-labelclassification, ACM Trans. Knowl. Discovery Data 4 (2) (2010) 1–29.

[29] Y. Amit, M. Fink, N. Srebro, S. Ullman, Uncovering shared structures inmulticlass classification, Proc. ICML (2007) 17–24.

[30] J. Chen, L. Tang, J. Liu, J. Ye, A convex formulation for learning shared structuresfrom multiple tasks, Proc. ICML (2009) 137–144.

[31] G. Cawley, N. Talbot, M. Girolami, Sparse multinomial logistic regression viabayesian L1 regularisation, Proc. NIPS (2006) 209–216.

Caijuan Shi received the B.S. degrees in electronicinformation engineering from Jilin University, Chang-chun, China, in 2000, the M.S. degrees in signal andinformation processing from Tianjin University, Tianjin,China, in 2007. She is currently a Ph.D. student inInstitute of Information Science, Beijing Jiaotong Uni-versity, Beijing, China. Her research interests includemultimedia analysis, computer vision and machinelearning.

Qiuqi Ruan received the B.S. and M.S. degrees fromNorthern Jiaotong University in 1969 and 1981 respec-tively. From January 1987 to May 1990, he was avisiting scholar in the University of Pittsburgh, andthe University of Cincinnati. He has published 4 booksand more than100 papers. Now he is a professor,doctorate supervisor in Institute of InformationScience, Beijing Jiaotong University, Beijing, China. Heis a senior member of IEEE. His main research interestsinclude digital signal processing, computer vision, pat-tern recognition, and virtual reality etc.



http://refhub.elsevier.com/S0925-2312(14)01206-5/sbref1












































































Song Guo received the B.S. degree in biomedicalengineering from Beijing Jiaotong University, Beijingin 2007. He is currently a Ph.D. student in Institute ofInformation Science, Beijing Jiaotong University, Beij-ing, China. His research interests include pattern recog-nition, image processing and machine learning.

Yi Tian received the B.S. degree in biomedical engi-neering from Beijing Jiaotong University, Beijing in2011. She is currently a Ph.D. student in Institute ofInformation Science, Beijing Jiaotong University, Beij-ing, China. Her research interests include patternrecognition, image processing and machine learning.






Documents

Sparse feature selection based on L2,1/2-matrix norm for web image annotation