IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ...tangjili/publication/LUFS_TKDE.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING APRIL 2012 1 An Unsupervised Feature Selection Framework

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING APRIL 2012 1

An Unsupervised Feature SelectionFramework for Social Media Data

Jiliang Tang, Huan Liu, Fellow, IEEE

Abstract—The explosive usage of social media produces massive amount of unlabeled and high-dimensional data. Featureselection has been proven to be effective in dealing with high-dimensional data for efficient learning and data mining.Unsupervised feature selection remains a challenging task due to the absence of label information based on which featurerelevance is often assessed. The unique characteristics of social media data further complicate the already challenging problemof unsupervised feature selection, e.g., social media data is inherently linked, which makes invalid the independent and identicallydistributed assumption, bringing about new challenges to unsupervised feature selection algorithms. In this paper, we investigatea novel problem of feature selection for social media data in an unsupervised scenario. In particular, we analyze the differencesbetween social media data and traditional attribute-value data, investigate how the relations extracted from linked data can beexploited to help select relevant features, and propose a novel unsupervised feature selection framework, LUFS, for linked socialmedia data. We systematically design and conduct systemic experiments to evaluate the proposed framework on datasets fromreal-world social media websites. The empirical study demonstrates the effectiveness and potential of our proposed framework.

Index Terms—Unsupervised Feature Selection, Linked data, Social Media, Pseudo Labels, Social Dimension Regularization

✦

1 INTRODUCTION

In recent years, the rapid emergence of social mediaservices such as Facebook and Twitter allows moreand more users to participate in online social activitiessuch as posting blogs or microblogs, uploading photosand connecting with other like-minded users. Theexplosive popularity of social media produces mas-sive data at an unprecedented speed. For example,250 million tweets are posted per day1; 3,000 photosare uploaded per minute to Flickr; and the numberof Facebook users has increased from 100 million in2008 to 800 million in 20112. The massive and high-dimensional social media data challenges traditionaldata mining tasks such as classification and clusteringdue to curse of dimensionality and scalability issues.One traditional and effective approach to handle high-dimensional data is feature selection [8], [20], whichaims to select a subset of relevant features from high-dimensional feature space that minimize redundancyand maximize relevance to the targets (e.g., classlabel). Feature selection helps improve the perfor-mance of learning models by alleviating the curse ofdimensionality, speeding up the learning process, andimproving the generalization capability of a learningmodel [18], [21].

Feature selection algorithms are broadly dividedinto supervised and unsupervised feature selection ac-cording to whether the training data is labeled orunlabeled. In supervised feature selection, the rele-

• J. Tang and H. Liu are with the Department of Computer Science,Arizona State University, Tempe, AZ, 85281. E-mail: {Jiliang.Tang,Huan.Liu}@asu.edu

• An shorter version of this submission was published in the proceedingof KDD2012 with the title “Unsupervised Feature Selection for linkedSocial Media Data” [33].

1. http://techcrunch.com/2011/06/30/twitter-3200-million-tweets/

2. http://en.wikipedia.org/wiki/Facebook

vance of features is accessed by their capability ofdistinguishing different classes. It is time-consumingand costly to obtain labeled data and given the scaleof unlabeled data in social media, we propose to studyunsupervised feature selection. Without label infor-mation, unsupervised feature selection is particularlydifficult due to the definition of relevancy of featuresbecomes unclear [11], [10], [4], [1]. Furthermore, withhigh-dimensional data, it is likely to find many sets offeatures that seem equally good without consideringadditional constraints [11], [10]. Most existing featureselection algorithms work only with attribute-valuedata while social media data is inherently linked,adding further challenges to feature selection.

As we know, linked data provides link informa-tion beyond attribute value data. Figure 1 shows asimple example of linked data in social media andits two data representations. Figure 1(a) shows 8linked instances (u1 to u8). These instances usuallyform groups and instances within groups have moreconnections than instances between groups [36]. Forexample, u1 has more connections to {u2, u3, u4} than{u5, u6, u7, u8}. Figure 1(b) is a conventional repre-sentation of attribute-value data (or instance-featurematrix): rows are instances and columns are features.For linked data, except for the conventional repre-sentation, there is link information between instancesas shown in Figure 1(c) (or instance-instance ma-trix), in general, indicating the correlations amonginstances [35]. The availability of link informationpresents unprecedented opportunities to advancedresearch for feature selection.

Linked data in social media presents both chal-lenges as well as new opportunities for unsupervisedfeature selection. In this paper, we study a novelproblem of unsupervised feature selection for linkeddata. In particular, we investigate: (1) how to exploitand model the relations among data instances, and (2) how


��

�

�!

�" �#

�$

�%

�&

(a) Linked Data

��

�

�!

�"

�#

�$

�%

�&

'� ' '( …. …. ….

(b) Attribute-Value Data

��

�

�!

�"

�#

�$

�%

�&

'� ' '( …. …. ….

1 1

1 1 1

1 1 1

1 1 1 1

1 1 1

1 1 1

1 1

1 1 1

��

�

�!

�"

�#

�$

�%

�&

�� ! �$ �" �# �% �&

(c) Attribute-Value and Linked Data

Fig. 1. A Simple Example of Linked Social Media Data

to take advantage of these relations for feature selection inan unsupervised scenario. One key difference betweensupervised and unsupervised feature selection is theavailability of label information. If we change ourperspective and consider label information as a sortof constraints in the learning space, we can turn bothtypes of feature selection to a problem with the sameintrinsic property: selecting features to be consistentwith some constraints [39]. In supervised learning,label information plays the role of constraint. With-out labels, alternative constraints can be used, suchas data variance and separability, for unsupervisedlearning [30], [8], [5], [16], [39], [4]. However, mostof them evaluate the importance of a feature individ-ually and neglect the feature correlation [40], [38]. Ad-ditionally, without considering additional constraints,these methods are likely to find many sets of featuresthat seem equally good [20]. The availability of linkinformation can provide more constraints for unsu-pervised feature selection and can potentially improveits performance. Furthermore, social theories are de-veloped by sociologists to explain the formation oflinks in social media. For example, social correlationtheories suggest that linked instances are likely to besimilar. These social theories can help us exploit andmodel link information.

To address the challenges of unsupervised featureselection for linked social media data, we proposea novel framework of Linked Unsupervised FeatureSelection (LUFS). Our main contributions are summa-rized below:

• Introducing the concept of pseudo-class labels,enabling us to extract constraints from linkeddata, i.e., link information and attribute-valueinformation, for unsupervised feature selection.

• Developing two approaches to exploit the indi-vidual and group behaviors of linked instancesvia graph regularization and social dimensionregularization, separately.

• Proposing a novel unsupervised feature selectionframework, LUFS, for linked data in social mediato exploit linked information in selecting features;

• Evaluating LUFS extensively using datasets fromreal-world social media websites to understand-ing the working of LUFS.

The rest of this paper is organized as follows. Weformally define the problem of unsupervised featureselection for linked data in social media in Section

2; introduce our new framework of unsupervisedfeature selection, LUFS, in Section 3, including ap-proaches to capture link information, optimization,and convergence analysis; present empirical evalua-tion with discussion in Section 4; and the related workin Section 5; and conclude this work in Section 6.

2 PROBLEM STATEMENT

In this paper, scalars are denoted by lower-case letters(a, b, . . . ;α, β, . . .), vectors are written as lower-casebolded letters (a,b, . . .), and matrices correspond toboldfaced upper-case letters (A,B, . . .). A(i, :) andA(:, j) denote the ith row and jth column of A,respectively.

Let p = {p1, p2, . . . , pn} be the set of instances insocial media (e.g., u1 to u8 in Figure 1) where n is thenumber of data instances and f = {f1, f2, . . . , fm} bethe set of features where m is the number of features.Let X = (x1,x2, . . . ,xn) ∈ R

m×n be the conventional(matrix) representation of p w.r.t. f where xi(j) is thefeature value of fj in pi (e.g., Figure 1(b)). We assumethat the data, X, is centered, that is,

∑n

i=1 xi = 0,which can be realized as X = XP where P = In −1n1n1

⊤n .

For social media data, there also exist connectionsamong its data instances in the form of link informa-tion. Let R ∈ R

n×n be their link representation whereR(i, j) = 1 if pi and pj are linked, otherwise zero(e.g., the right subgraph in Figure 1(c)). In this work,we assume that the relations among data instance isundirected, i.e., R = R⊤, however, the extension todirected relations is straightforward.

With the above notations defined, the problem ofunsupervised feature selection for linked data can be statedas:

Given n linked data instances, its conventionalrepresentation X, its link representation R, developa method F , which can select a subset of relevantfeatures f ′ from f with size k by utilizing both X

and R for these n linked instances in an unsupervisedscenario, formally stated as,

F : {f ;X,R} → {f ′}. (1)

while the traditional task of unsupervised featureselection, formally defined as,

F : {f ;X} → {f ′}. (2)


Our proposed task substantially differs from tradi-tional unsupervised feature selection problem - linkeddata provides link information beyond attribute-valuedata and in general, link information reveals the corre-lation between instances, enabling advanced researchfor feature selection.

3 UNSUPERVISED FEATURE SELECTION

FRAMEWORK: LUFS

Let s = π(

m−k︷︸︸︷

0, . . . , 0,

k︷︸︸︷

1, . . . , 1), where π(·) is the permu-tation function and k is the number of features toselect where s(i) = 1 indicates that the i-th featureis selected. The original data can be represented asdiag(s)X with k selected features, where diag(s) is adiagonal matrix. The difficulty faced with unsuper-vised feature selection is due to lack of class labels.Hence, we introduce the concept of pseudo-class labelto guide unsupervised feature selection. We assumethat there is a mapping matrix W ∈ R

m×c, which as-signs each data point with a pseudo-class label wherec is the number of pseudo-class labels. The pseudo-class label indicator matrix is Y = W⊤diag(s)X ∈R

c×n. Yji indicates the likelihood of the i-th instancebelonging to the j-th class. Following the widelyadopted constraints on the cluster indicator matrix,we add orthogonal constraints on Y. Since X iscentered, it is easy to verify that Y is also centered,

n∑

i=1

yi =n∑

i=1

(W⊤diag(s)xi

)=

(W⊤diag(s)

)n∑

i=1

xi = 0.

Both supervised and unsupervised feature selectiontasks aim to solve the same problem: selecting featuresconsistent with given constraints. In the supervisedsetting, label information plays the role of constraints;in the unsupervised setting, pseudo-class label infor-mation is tantamount to the provided label informa-tion and we seek pseudo-class label information fromlinked social media data. In particular, we seek con-straints from both link information R and attribute-value part X to fit pseudo-class label. For each typeof information, we provide both an intuitive way andan advanced way to extract constraints for featureselection in this paper. The reasons are two-fold. First,feature selection for linked data is a relatively novelproblem; we try to explore it more widely to seek abetter and deeper understanding of this problem. Sec-ond, we try to investigate both intuitive and recentlyproposed techniques in the studied problem to helpus understand their advantages and disadvantages inthe proposed problem.

For link information, a widely adopted assumptionis that linked instances are likely to share similarlabels, which can be explained by social correlationtheories such as homophily [25] and social influ-ence [24]. Based on this intuition, we propose graphregularization to capture link information based onsocial correlation theories. Users in social networksare likely to form groups and users within groups

are similar while users from different groups aredissimilar, which is supported by social dimensionassumption [34]. An advanced way social dimensionregularization based on social dimension assumptionis proposed to extract constraints from link informa-tion as well.

For attribute-value information, the intuitive wayis spectral analysis based on the assumption thatinstances with similar content should have similarlabels. Recently, local discriminate analysis is intro-duced in feature selection for attribute-value data andpromising results are reported [38]. These motivate usto capture attribute-value information through spec-tral analysis or discriminate analysis. Considering dif-ferent combinations of components for capturing linkinformation and attribute-value information, there arefour variants of the proposed framework LUFS, i.e.,LUFS-SG, LUFS-SSD, LUFS-DG and LUFS-DSD, asshown in Table 1. More details will be introduced inthe following subsections.

3.1 Capturing Link Information

In general, linked instances are correlated and allowus to exploit their correlations for feature selection. Infollowing subsections, by considering their individualand group behaviors, we introduce graph regulariza-tion and social dimension regularization to capturethe dependency among linked instances as individu-als and groups, respectively.

3.1.1 Graph Regularization for Link Information

Two linked instances are distinct from two instancesof attribute-value data due to the correlations be-tween them [31]. For example, two linked users inTwitter are more likely to have similar interests thantwo randomly picked users. These correlations canbe explained by social correlation theories such ashomophily [25] and social influence [24]. Homophilysuggests that two instances with similar topics aremore likely to be linked while social influence theoryindicates that two linked instances are more likely tohave similar topics.

Social correlation theories suggest that two linkedinstances in social media are more likely to have sim-ilar labels, which can be exploited through pseudo-class label. For pi and pj , if linked, we want to theirlabels, i.e., Y(:, i) and Y(:, j), are similar. Therefore,we force the labels of two linked data instances close toeach other via minimizing the following term of graphregularization,

1

2

∑

i

∑

j

R(i, j)‖Y(:, i)−Y(:, j))‖22,

=1

2

∑

k

∑

i

∑

j

R(i, j)(Y(k, i)−Y(k, j))2,

=∑

k

Y(k, :)(DR −R)Y(k, :),

= Tr(YLRY⊤), (3)


TABLE 1Four Variants of LUFS

Link InformationGraph Regularization Social Dimension

(G) Regularization (SD)Attribute-Value Spectral Analysis (S) LUFS-SG LUFS-SSD

Information Discriminative Analysis (D) LUFS-DG LUFS-DSD

where LR = DR−R is the Laplacian matrix based onthe link information R, and DR is a diagonal matrixwith DR(i, i) =

∑

j R(j, i).

3.1.2 Social Dimensions for Link Information

In [34], social dimension is introduced as a meansto integrate the interdependency among linked dataand attribute-value data. Social dimension can cap-ture group behaviors of linked instances: instancesfrom different social dimensions are dissimilar whileinstances in the same social dimension are similar.Flattening linked data with social dimensions, tra-ditional classification methods such as SVM obtainbetter performance for relational learning [34].

Social dimension extraction is a well-studied prob-lem by social network analysis community [26], [12],[34]. In our work, we adopt a widely used algorithm,i.e., Modularity Maximization [26], which can formu-lated as:

maxH⊤H=I

Tr(H⊤MH

), (4)

where H ∈ Rn×K is the social dimension indicator

matrix, K is the number of social dimensions, and M

is the modularity matrix defined as:

M = R− dd⊤

2n, (5)

where d is a degree vector, and di is the degree of pi.It is easy to verify that the optimal H is the top-K

eigenvectors of the modularity matrix M. H is contin-uous and is mixed with positive and negative values.Similar to spectral clustering to obtain the clusterlabels, the standard K-means algorithm is performedon H to obtain the discrete-valued social dimensionassignment , H(i, j) = 1, if pi is in the j-th dimension,and H(i, j) = 0, otherwise.

Recall the simple example of linked data in Fig-ure 1, we can extract two social dimensions, i.e,{u1, u2, u3, u4} and {u5, u6, u7, u8}. From Figure 1,we can observe that instances in the same socialdimension have more connections while have fewerconnections between social dimensions. The discrete-valued social dimension indicator matrix, H, is:

H⊤ =

(1 1 1 1 0 0 0 00 0 0 0 1 1 1 1

)

,

Since social dimensions represent a type of useraffiliations, inspired by Linear Discriminate Analysis,we define three matrices: within, between, and totalsocial dimension scatter matrices Sw, Sb, and St as

follows:

Sw = YY⊤ −YFF⊤Y⊤,

Sb = YFF⊤Y⊤,

St = YY⊤, (6)

where Sw + Wb = St. F is the weighted socialdimension indicator matrix, which can be obtainedfrom H as below,

F = H(H⊤H)−1

2 , (7)

where the j-th column of F is given by

Fj =(0, . . . , 0,

hj

︷︸︸︷

1, . . . , 1, 0, . . . , 0)√

hj

. (8)

where hj is the number of instances in the j-th socialdimension.

Instances in the same social dimension are similarand instances from different social dimensions aredissimilar. We can obtain a constraint from link in-formation, social dimension regularization, to modelthe relations among linked instances, via maximizingthe following term,

Tr((St)

−1Sb

)= Tr

((YY⊤)−1YFF⊤Y⊤

). (9)

3.2 Capturing Attribute-Value Information

With the pseudo-class label, we are further allowedto capture information from attribute-value part in asupervised manner. In the following subsections, wewill introduce two methods to exploit attribute-valueinformation for feature selection based on two re-cently developed techniques, i.e., spectral analysis [23]and discriminative analysis [38].

3.2.1 Spectral Analysis for Attribute-Value Informa-tion

Spectral analysis indicates that similar data instancesshould have similar labels. Assume that S ∈ R

n×n isthe similarity matrix based on X, obtained through aRBF kernel in this work as,

S(i, j) = e−‖xi−xj‖

2

σ2 . (10)

We force the pseudo-class labels of similar datainstances to be close to each other. Therefore theconstraint from attribute-value part can be formulated


to minimize the following term,

1

2

∑

i

∑

j

S(i, j)‖Y(:, i)−Y(:, j))‖22,

=1

2

∑

k

∑

i

∑

j

S(i, j)(Y(k, i)−Y(k, j))2,

=∑

k

Y(k, :)(DS − S)Y(k, :),

= Tr(YLSY⊤), (11)

where LS = DS − S is a Laplacian matrix and DS

is a diagonal matrix with its elements defined asDS(i, i) =

∑n

j=1 S(i, j).

3.2.2 Discriminative Analysis for Attribute-Value Infor-mation

Another advantage from the introducing of thepseudo-class label is that it allows us to exploitlocal discriminative information [38] for attribute-value part. For each data point pj , we obtain its K ′

nearest neighbors, denoted as U j = {j1, j2, . . . , jK′}.Let Nj = [xj ,xj1 , . . . ,xjK′ ] and Yj = [Y(:, j),Y(:, j1), . . . ,Y(:, jK′)] be the local data and local labelmatrix, respectively. It is easy to verify that Yj = YZj

where Zj ∈ {0, 1}n×(K′+1) is defined as,

Zj(p, q) =

{

1 if p = {U j(q)},0 otherwise.

(12)

where U j = {j,U j}. Inspired by Linear DiscriminateAnalysis, the local total scatter matrix Tj and betweenclass scatter matrix Bj for xj are defined as below,

Tj = NjP(Nj)⊤,

Bj = NjP(Yj)⊤YjP(Nj)⊤, (13)

where P = I − 1K′+11K′+11

⊤K′+1 ∈ R

(K′+1)×(K′+1). It

is easy to verify that P = P⊤ = P⊤P. Local discrim-inative information for pj captures the discriminativeinformation among pj and its neighbors. With thedefinition of Tj and Bj and analogous to LinearDiscriminate Analysis, local discriminative score, DSj

for xj is defined as,

DSj = Tr((Tj + ηI)−1Bj

). (14)

where ηI is added to make the term Tj+ηI invertible.Apparently, a larger DSj suggests a higher discrim-inative ability w.r.t. the datum of xj . Therefore localdiscriminative information of attribute-value part canbe captured via obtaining the highest discriminativescores for all data instances, which can be formulatedto maximize:

∑n

j=1 DSj . After some derivations, themaximization problem can be converted into the fol-lowing minimization problem [38],

minW

Tr(YGY⊤), s.t. YY⊤ = I (15)

where G is defined as,

G =n∑

j=1

(ZjP(NjP(Nj)⊤ + ηI)−1P(Zj)⊤

). (16)

3.3 Unsupervised Feature Selection Framework

By introducing the concept of pseudo-class labels, con-straints from both link information and unlabeledattribute-value data are ready for unsupervised fea-ture selection. Our framework of linked unsupervisedfeature selection, LUFS, is conceptually shown in Fig-ure 2 with our solutions to the two challenges (needto take into account of linked data and lack of labels):extracting constraints from both linked and attribute-value data to fit the introduced pseudo-class labels.Considering different combinations of components forcapturing link information and attribute-value infor-mation, we propose four variants of LUFS, i.e., LUFS-SG, LUFS-SSD, LUFS-DG and LUFS-DSD, as shownin Table 1. The optimization problems for these fourvariants are stated in Table 2.

Since YY⊤ = I, St in social dimension regular-ization is an identity matrix and Tr(YY⊤) = c isa constant. Social dimension regularization can berewritten as a minimization problem,

minTr(YY⊤)− Tr(YFF⊤Y⊤). (17)

The reason for adding the constant in social dimen-sion regularization will be explained later.

The constraint on s makes problem in Table 2 mixedinteger programming [3], which is still difficult tosolve. We observe that diag(s) and W is as the formof diag(s)W in Y. Since s is a binary vector andm − k rows of the diag(s) are all zeros, diag(s)Wis a matrix where the elements of many rows areall zeros. This motivates us to absorb the diag(s)into W, W = diag(s)W, and add ℓ2,1 norm on W

to ensure the sparsity of W in rows and achievefeature selection. The problems for LUFS-SG, LUFS-SSD, LUFS-DG and LUFS-DSD after relaxation areshown in Table 3.

Since X is high dimensional, XX⊤ might be singu-lar. We add λI to (XX⊤ to make XX⊤+λI) nonsingu-lar and this setting is very common in practice such asthe standard linear regression problem and the lineardiscriminate analysis. ‖W‖2,1 controls the capacity of‖W‖ and also ensures that ‖W‖ is sparse in rows,making it particularly suitable for feature selection.‖W‖2,1 is the ℓ2,1-norm of ‖W‖ defined as [6],

‖W‖2,1 =

m∑

i=1

√√√√

k∑

j=1

W2(i, j) =

m∑

i=1

‖W(i, :)‖2. (18)

Next we will develop a theorem to show thatLUFS-SG, LUFS-SSD, LUFS-DG and LUFS-DSD canbe converted into a unique form after some directderivations, beginning with the following lemma.

Lemma 1. Assume E ∈ Rn×m is a matrix with

orthonormal columns, that is, E⊤E = Im, then In−EE⊤

is a symmetric and positive semi-definite matrix.Proof: It is easy to verify that In − EE⊤ is a

symmetric matrix since (In −EE⊤)⊤ = In −EE⊤.Let E = UΣV⊤ be the Singular Value Decomposi-

tion (SVD) of E, where U ∈ Rn×n and V ∈ R

m×m are


� =

diag( )

!

"

#

Pseudo-Class Label

Constraints from

Attribute-Value

Information

Feature

Selection

Matrix

Mapping

Matrix

Constraints from

Link Information

"

Link

Information

Attribute-Value

Information

Fig. 2. Illustration of the LUFS Framework

TABLE 2Optimization Problems for the Four Variants of LUFS

LUFS-SG minW,s Tr(YLSY⊤) + αTr(YLRY⊤) s.t. s ∈ {0, 1}n, s⊤1n = k, YY⊤ = I.

LUFS-SSD minW,s Tr(YLSY⊤)− αTr

(

(St)−1Sb

)

s.t. s ∈ {0, 1}n, s⊤1n = k, YY⊤ = I.

LUFS-DG minW,s Tr(YGY⊤) + αTr(YLRY⊤) s.t. s ∈ {0, 1}n, s⊤1n = k, YY⊤ = I.

LUFS-DSD minW,s Tr(YGY⊤)− αTr(

(St)−1Sb

)

s.t. s ∈ {0, 1}n, s⊤1n = k, YY⊤ = I.

TABLE 3Optimization Problems for the Four Variants of LUFS after Relaxation

LUFS-SGminW Tr(W⊤XLSX

⊤W) + β‖W‖2,1 + αTr(W⊤XLRX⊤W)s.t. W⊤(XX⊤ + λI)W = Ic,

LUFS-SSDminW Tr(W⊤XLSX

⊤W + β‖W‖2,1 + αTr(W⊤X(In − FF⊤)X⊤W)s.t. W⊤(XX⊤ + λI)W = Ic,

LUFS-DGminW Tr(W⊤XGX⊤W + β‖W‖2,1 + αTr(W⊤XLRX⊤W))

s.t. W⊤(XX⊤ + λI)W = Ic,

LUFS-DSDminW Tr(W⊤XGX⊤W) + β‖W‖2,1 + αTr(W⊤X(In − FF⊤)X⊤W)

s.t. W⊤(XX⊤ + λI)W = Ic,

orthogonal. Σ = diag(σ1, . . . , σm, 0, . . . , 0) is diagonalwhere σ1 ≥ σ2 ≥ . . . ≥ σm ≥ 0. Then we have,

E⊤E = Im → VΣ⊤ΣV⊤ = VV⊤

→ 1 ≥ σ1 ≥ σ2 ≥ . . . ≥ σm ≥ 0. (19)

Then,

In −EE⊤ = U(In − ΣΣ⊤)U⊤. (20)

Since (In−ΣΣ⊤) is a diagonal matrix and the diagonalelements are all nonnegative, (In−ΣΣ⊤) is a positivesemi-definite matrix. Thus U(In−ΣΣ⊤)U⊤ is a semi-definite matrix, which completes the proof.

The importance of Lemma 1 is two-fold. First, itexplains why we need a constant on social dimensionregularization. Adding the constant can guarantee theconvexity of the objective functions for LUFS-SSD andLUFS-DSD. Second, it paves the way for the followingtheorem for LUFS.

Theorem 3.1: Problems in Table 3 can be convertedinto the following unique optimization problem,

minW

f(W) = Tr(W⊤AW) + β‖W‖2,1,

s.t. W⊤BW = Ic, (21)

where A is a symmetric and positive semi-definitematrix and B is a symmetric and positive matrix.

Proof: We will use LUFS-SSD as an example todemonstrate the proof process. According to aboveanalysis, LUFS-SSD is to solve the following mini-mization problem,

minW

Tr(W⊤XLSX⊤W + β‖W‖2,1

+ αTr(W⊤X(In − FF⊤)X⊤W)

s.t. W⊤(XX⊤ + λI)W = Ic. (22)

It suffices to show how to construct a symmetricand positive semi-definite matrix A and a symmetricand positive matrix B from Eq. (22) for LUFS-SSD.

We construct B = XX⊤ + λI. It is easy to checkthat B is symmetric and positive when λ 6= 0 andthen the constraint in Eq. (22) is converted into theone in Eq. (21).

We construct A = XLSX⊤+αX(In−FF⊤)X⊤ and

then the objective function in Eq. (22) is convertedinto the one in Eq. (21). According to Lemma 1,(In − FF⊤) is symmetric and positive semi-definite;and according to the definition of Laplacian matrix,XLX⊤ is symmetric and positive semi-definite. Thus,A is symmetric and positive semi-definite.

Similar Process can be applied to LUFS-SG, LUFS-DG and LUFS-DSD. To save space, we ignore the


details and summarize the definitions of A and B forLUFS-SG, LUFS-DG and LUFS-DSD as follows:

LUFS-SG : A = XLSX⊤ + αXLRX

⊤,B = XX⊤ + λI,

LUFS-DG : A = XGX⊤ + αX(In − FF⊤)X⊤,

B = XX⊤ + λI,

LUFS-DSD : A = XGX⊤ + αXLRX⊤,B = XX⊤ + λI.

which completes the proof.

3.4 Optimization Algorithm for LUFS

In recent years, many methods have been proposedto solve the ℓ2,1-norm minimization problem [22],[27], [38]. However, our problem is different fromthese existing ones due to the orthogonal constraint inEq. (21). Hence, we propose the following algorithm,as shown in Algorithm 2, to optimize the problem inEq. (21).

Algorithm 1 LUFS

Input: {X,R, α, β, λ, c,K, k}Output: k most relevant features

1: Construct A according to the components chosenfor LUFS

2: Set B = XX⊤ + λI3: Set t = 0 and initialize D0 as an identity matrix4: while Not convergent do5: Set Ct = B−1(A+ βDt)6: Set Wt = [q1, . . . , qc] where q1, . . . , qc are the

eigenvectors of Ct corresponding to the first csmallest eigenvalues

7: Update the diagonal matrix Dt+1, where the i-th diagonal element is 1

2‖Wt(i,:)‖2

;8: Set t = t+ 19: end while

10: Sort each feature according to ‖W(i, :)‖2 in de-scending order and select the top-k ranked ones;

Next we briefly review Algorithm 2. In line 1,according to the components chosen for LUFS, weconstruct A referring to Theorem 3.1. For example,we do the following steps to construct A for LUFS-SSD,

• Obtain the social dimension indicator matrix H,• Set F = H(H⊤H)−

1

2 ,• Construct S through RBF kernel and set L = DS−

S,• Construct A = XLX⊤ + αX(In − FF⊤)X⊤.

the iterative algorithm to optimize Eq. (21) is pre-sented from line 4 to line 9. The convergence analysisof the algorithm starts with the following two lemmas.

Lemma 2. A ∈ Rm×m is symmetric and positive semi-

definite and B ∈ Rm×m is symmetric and positive. If W ∈

Rm×c solves the following minimization problem:

minW

Tr(W⊤AW),

s.t. W⊤BW = Ic, (23)

then W consists of the eigenvectors of B−1A correspond-ing to the c smallest eigenvalues.

Proof: Since B is symmetric and positive, let B =UΣU⊤ be the Singular Value Decomposition (SVD)of B. It follows that B = UΣ

1

2Σ1

2U⊤ = EE⊤ whereE = UΣ

1

2 . The objective function in Eq. (23) can berewritten as,

Tr(W⊤AW),

= Tr(W⊤EE−1A(E−1)⊤E⊤W),

= Tr((W⊤E)E−1A(E−1)⊤(E⊤W)

). (24)

Set E⊤W = W and W⊤W⊤ = W⊤BW = Ic, thenEq. (23) can be rewritten as,

minW

Tr(W⊤E−1A(E−1)⊤W),

s.t. WW = Ic, (25)

According to Ky Fan Theorem [17], the optimalW is given by the smallest c singular value ofE−1A(E−1)⊤. We assume that gi is the i-th eigenvec-tor of E−1A(E−1)⊤ corresponding to the i-th smallesteigenvalue and we have,

E⊤W = [g1,g2, . . . ,gc]

⇒ W = (E−1)⊤[g1,g2, . . . ,gc] (26)

It is easy to verify that (E−1)⊤gi is the i-th eigen-vector of B−1A corresponding to the i-th smallesteigenvalue. Thus the optimal W consists of the eigen-vectors of B−1A corresponding to the c smallesteigenvalues, which completes the proof.

Lemma 3. The following inequality holds if wit|ri=1 are

non-zero vectors, where r is an arbitrary number.

∑

i

‖wit+1‖2 −

∑

i

‖wit+1‖2

2‖wit‖2

≤∑

i

‖wit‖2 −

∑

i

‖wit‖22

2‖wit‖2

.

(27)

Proof: According to [27], [31], the following in-equality holds for any positive constants a and b,

√a− a

2√b≤

√b− b

2√b

(28)

Use ‖wit+1‖2 and ‖wi

t‖22 to replace a and b inEq. (28), respectively, we can see that the followinginequality holds for any i.

‖wit+1‖2 −

‖wit+1‖2

2‖wit‖2

≤ ‖wit‖2 −

‖wit‖22

2‖wit‖2

(29)

By summing Eq. (29) over i, Eq. (27) holds, whichcompletes the proof.

With the above two lemmas, we develop the fol-lowing theorem regarding the convergence of Algo-rithm 2.

Theorem 3.2: At each iteration of Algorithm 2, thevalue of the objective function in Eq. (21) monotoni-cally decreases.


Proof: According to Lemma 2, Wt+1 in line 9 ofAlgorithm 2 is the solution to the following problem,

Wt+1 = minW⊤BW=I

Tr(W⊤(A+ βDt)W

), (30)

which indicates that,

Tr(W⊤

t+1(A+ βDt)Wt+1

)≤ Tr

(W⊤

t (A+ βDt)Wt

).

That is to say,

Tr(W⊤t+1AWt+1) + β

∑

i

‖Wt+1(i, :)‖222‖Wt(i, :)‖2

≤ Tr(W⊤t AWt) + β

∑

i

‖Wt(i, :)‖222‖W(

ti, :)‖2. (31)

Then we have the following inequality,

Tr(W⊤t+1AWt+1) + β

∑

i

‖Wt+1(i, :)‖2

− β(∑

i

‖Wt+1(i, :)‖2 −∑

i

‖Wt+1(i, :)‖222‖Wt(i, :)‖2

)

≤ Tr(W⊤t AWt) + β

∑

i

‖Wt(i, :)‖2

− β(∑

i

‖Wt(i, :)‖2 −∑

i

‖Wt(i, :)‖222‖W(

ti, :)‖2).

Meanwhile, according to Lemma 3, we have,

∑

i

‖Wt+1(i, :)‖2 −∑

i

‖Wt+1(i, :)‖222‖Wt(i, :)‖2

≤∑

i

‖Wt(i, :)‖2 −∑

i

‖Wt(i, :)‖222‖W(

ti, :)‖2.

Therefore, we have the following inequality:

Tr(W⊤t+1AWt+1) + β‖Wt+1‖2,1 ≤

Tr(W⊤t AWt) + β‖Wt‖2,1, (32)

which completes the proof.According to Theorem 3.2, Algorithm 2 converges

to the optimal W for the problem in Eq. (21). Next,we will analyze the time complexity of Algorithm 2.

Time Complexity : there are three main computa-tions in Algorithm 2 - computing A, computing B andcomputing the c eigenvectors of Ct. For four variantsof LUFS, the time complexities of computing B andcomputing the c eigenvectors of Ct is the same as,

• The major operation of constructing B is XX⊤,whose time complexity is about O(n2m).

• The most time-consuming operation of updatingW is to obtain the c eigenvectors of Ct ∈ R

m×m

corresponding to the first c smallest eigenvalues.The time complexity of this operation achievedby Lanczos algorithm is O(cm2).

However, the time complexities of computing A fordifferent variants are different. The time complexity tocompute A contains two parts. The first part is to com-pute XLSX

⊤, XGX⊤, XLRX⊤ or X(In − FF⊤)X⊤.

The time complexity of this part is the same for all

TABLE 4Statistics of the Datasets

BlogCatalog Flickr

Size 5,198 7,575# of Features 8,189 12,047# of Classes 6 9# of Links 27,965 47,344# Ave Degree 5.38 6.25Clustering Coefficient 0.1224 0.3301

variants as O(n2m). The second part is to calculate LS ,G, LR or In − FF⊤, which needs O(n2m), O(K ′n2),O(n) 3 and O(Kn2+cKn+cn2), respectively. Here wecan conclude that for link information, the time com-plexity of graph regularization O(n) is lower than thatof social dimension regularization O(Kn2+cKn+cn2),while for attribute-value information, the time com-plexity of discriminative analysis O(K ′n2) is lowerthan that of spectral analysis O(n2m). Therefore theoperations that we need to compute the second part offour variants satisfy LUFS-SSD > LUFS-SG > LUFS-DSD > LUFS-DG.

4 EXPERIMENTS AND DISCUSSION

In this section, we present experiment details to verifythe effectiveness of the proposed framework, LUFS.After introducing real-world social media datasets, wefirst compare the four variants of LUFS, and evalu-ate the effectiveness of LUFS in terms of clusteringperformance, then study the effects of parameters onperformance and finally further verify the constraintextracted from link information by social dimension.

4.1 Datasets

We collect two datasets from real-world social mediawebsites, i.e., BlogCatalog4 and Flickr5, which are thesubsets of two public available datasets used in [36]to uncover overlapping groups in social media. Bothwebsites allow users to provide tags to indicate theirinterests, considered as the features; and users canalso subscribe to some interest groups, used as theground truth as the class labels in this work. In addi-tion, users can follow other users if they share similarinterests, forming link information. We compute thenumber of links for each instance and the distribu-tions are shown in Figure 3. The first observation isthat most instances have a few links, while a fewinstances have an extremely high number of links.These distributions suggest a power law distributionthat is typical in social networks. Some statistics ofthe datasets are shown in Table 4: Flickr has a densernetwork, indicating by its higher average degree andhigher clustering coefficient. Note that “# of Classes”in Table 4 denotes the number of ground-truth classesand the ground-truth classes are only used for theevaluation purpose.

3. the link information R is very sparse4. http://www.blogcatalog.com5. http://www.flickr.com/


101

102

103

100

101

102

103

# of Links

# o

f In

sta

nce

s

(a) BlogCatalog

100

101

102

103

104

100

101

102

103

# of Links

# o

f In

sta

nce

s

(b) Flickr

Fig. 3. Link Distribution

4.2 Baseline Methods and Metrics

LUFS is compared with the following three represen-tative unsupervised feature selection algorithms: (1)UDFS [38] selects features in batch mode by simul-taneously exploiting discriminative information andfeature correlation; (2) Laplacian Score [16] evaluatesthe importance of a feature through its power oflocality preservation; (3) SPEC [39] selects featuresusing spectral regression. Note that all these baselinemethods have the same time complexity as the pro-posed framework o(mn2).

Following the current practice of evaluating unsu-pervised feature selection, we assess LUFS in termsof clustering performance. The clustering quality isevaluated by two commonly used metrics, accuracyand normalized mutual information (NMI)6.

We vary the numbers of selected features as{200, 300, 400, 500, 600, 700, 800, 900, 1000}. Each fea-ture selection algorithm is first performed to selectfeatures, and then K-means clustering is performedbased on the selected features. Since K-means oftenconverges to local minima, we repeat each experiment20 times and report the average performance.

4.3 Comparison of Four Variants of LUFS

In this section, we compare the quality of selectedfeatures by the four variants of LUFS in terms of clus-tering performance. The purpose of these experimentsis two-fold. First we want to investigate the impact ofdifferent components of LUFS via its four variants.Second, we have shown the efficiency of these fourvariants via time complexity analysis and here wewant to show their effectiveness. For the parametersin the proposed algorithms, we try different param-eter values and report the best performance. Moredetails about parameter analysis will be discussedin the later subsections. The comparison results aredemonstrated in Figures 4 and 5 for Flickr and Blog-Catalog, respectively. In general, with the increaseof the number of selected features, the performancetrends to increase first and then decrease. Whenthe number of selected features is smaller, we losetoo much information, while the number of selectedfeatures is larger, we may introduce noisy features.Algorithms reach their best performance with smaller

6. We use the source code fromhttp://www.zjucadcg.cn/dengcai/Data/Clustering.html.

200 300 400 500 600 700 800 900 100027

28

29

30

31

32

33

Number of Features

Accura

cy(%

)

LUFS-SG

LUFS-SSD

LUFS-DG

LUFS-DSD

(a) Accuracy in Flickr

200 300 400 500 600 700 800 900 10000.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

Number of Features

NM

I

LUFS-SG

LUFS-SSD

LUFS-DG

LUFS-DSD

(b) NMI in Flickr

Fig. 4. The Comparison of Variants of LUFS in Flickr

200 300 400 500 600 700 800 900 100042

44

46

48

50

52

54

56

Number of Features

Accura

cy(%

)

LUFS-SG

LUFS-SSD

LUFS-DG

LUFS-DSD

(a) Accuracy in BlogCatalog

200 300 400 500 600 700 800 900 10000.17

0.175

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

Number of Features

NM

I

LUFS-SG

LUFS-SSD

LUFS-DG

LUFS-DSD

(b) NMI in BlogCatalog

Fig. 5. The Comparison of Variants of LUFS in Blog-Catalog

numbers of selected features in Blogcatalog than inFlickr. For example, most of algorithms achieve theirbest performance in terms of NMI with 200 selectedfeatures in BlogCatalog compared to 700 in Flickr.

We have following observations,

• LUFS-SG always obtains better performance thanLUFS-DG, and LUFS-SSD outperforms LUFS-DSD. For example, in BlogCatalog, on average,LUFS-SG and LUFS-SSD obtain 6.54% and 7.33%relative improvement in terms of accuracy com-paring to LUFS-DG and LUFS-DSD, respectively.When fixing the components of capturing linkinformation, for the components of capturingattribute-value information, spectral analysis (S)always obtains better performance than discrim-inative analysis (D).

• LUFS-SSD always outperforms LUFS-SG, andmost of the time, LUFS-DSD achieves betterperformance than LUFS-DG. For instance, inFlickr, the average improvement of LUFS-SSDand LUFS-DSD is 2.59% and 6.78% over LUFS-SG and LUFS-DG with respect to NMI, respec-tively. When fixing the components of capturingattribute-value information, social dimension reg-ularization is superior to graph regularization interms of capturing link information. Social di-mension regularization captures relations amonglinked instances as groups while graph regular-ization considers relations between two linkedinstances, indicating groups are better than indi-viduals in terms of capturing relations of linkeddata.

Although LUFS-SSD is not as efficient as othervariants as shown in the time complexity analysissubsection, we observe that most of the time, LUFS-SSD obtains the best performance among the four


TABLE 5Clustering Performance with Different Feature Selection Algorithms in Flickr

# Features Accuracy NMIUDFS LapScore SPEC LUFS UDFS LapScore SPEC LUFS

200 26.29 25.56 26.29 29.19 0.0361 0.0302 0.0361 0.0951300 26.41 25.79 26.29 29.51 0.0409 0.0555 0.0361 0.1026400 27.02 25.98 26.29 29.90 0.0702 0.0691 0.0361 0.1489500 27.29 26.00 26.29 32.70 0.0761 0.0709 0.0361 0.1601600 27.55 26.00 26.29 30.41 0.0811 0.0662 0.0361 0.1582700 27.95 26.00 26.29 31.17 0.0921 0.0630 0.0361 0.1701800 28.41 26.54 26.29 30.48 0.1062 0.0694 0.0361 0.1614900 31.07 30.90 26.35 31.79 0.1110 0.0816 0.0361 0.16811000 30.79 30.73 25.62 31.17 0.1200 0.1056 0.0559 0.1596

12047 25.49 25.49 25.49 25.49 0.0296 0.0296 0.0296 0.0296

variants of LUFS. For example, comparing to the bestperformance of other three variants, LUFS-SSD gains2.59% and 5.92% relative improvement in terms ofNMI in Flickr and BlogCatalog, respectively. Thereforein the following experiments, we choose LUFS-SSDfor LUFS, indicating that we choose spectral analy-sis to capture attribute-value information and socialdimension regularization to explore link information.

4.4 Quality of Selected Features

In this subsection, we compare the quality of featuresselected by different algorithms using performancemetrics given above. For baseline methods with theparameters, we try different parameter values andreport the best performance. The resulting parametervalues for LUFS are: {α = 0.1, β = 0.1,K = 70, c = 9}for Flickr while {α = 0.1, β = 0.1,K = 10, c = 6} forBlogCatalog. λ is used to make (XX⊤+λI) nonsingu-lar and according to empirical experience, we set λ to0.01 in both datasets. The comparison results of Flickrand BlogCatalog are shown in Tables 5 and 6, respec-tively. Note that the clustering performance with allfeatures (i.e., without feature selection) is reported inthe last columns.

We observe the performance change with the num-bers of selected features: it increases, reaches the peak,and then decreases. For example, LUFS achieves itspeak values when the number of selected featuresare 500 and 300 for Flickr and BlogCatalog, respec-tively. The clustering performance with as few as200 features is better than that with all features. Forinstances, LUFS obtains 10.51% and 18.68% relativeimprovement in terms of accuracy for Flickr and Blog-Catalog, respectively. These results demonstrate thatthe number of features can be significantly reducedwithout performance deterioration.

LapScore obtains comparable results with SPEC onboth datasets. Most of time, UDFS outperforms bothLapScore and SPEC, which is consistent with the re-sults reported in [38]. LapScore and SPEC analyze fea-tures separately and select features one after another,which may omit the possible correlation betweendifferent features. While UDFS selects features in abatch mode and considers feature correlation. Theseobservations support the conclusion in [4], [40], [38]:it is recommended to analyze data features jointlyfor feature selection. LUFS always obtains the best

TABLE 7Improvement of LUFS w.r.t. Performance of Clustering

# Features Flickr(%) BlogCatalog(%)Accuracy NMI Accuracy NMI

200 11.03 163.43 11.44 16.67300 11.74 84.86 7.40 17.38400 10.66 112.11 6.98 13.25500 19.82 110.38 4.90 15.65600 10.38 95.07 6.19 14.22700 11.52 84.69 7.12 16.75800 7.29 51.98 7.17 14.33900 2.32 51.44 4.19 13.22

1000 1.24 33.00 3.65 13.03

performance and the relative improvement of LUFScompared to the best performance of baseline methodis shown in Table 7. For each number of selected fea-tures, we conduct t-test to compare the performance ofLUFS and the best performance of baseline methodswith significance level 0.01, and these t-test resultsconsistently show that LUFS performs significantlybetter than baseline methods.

4.5 Parameter Analysis

We will use LUFS-SSD as an example to illustrateparameter selection for LUFS since LUFS-SSD alwaysobtains the best performance. In addition to determin-ing the number of selected features (which remainsan open problem [38]), LUFS has four importantparameters: the number of pseudo-class labels (c),the number of social dimensions (K), α (controllingsocial dimension regularization) and β (controllingℓ2,1-norm regularization). Hence, we study the effectof each of the four parameters (c, K, α, or β) by fixingthe other 3 to see how the performance of LUFS varieswith the number of selected features. The processes ofparameter selection for Flickr and for BlogCatalog aresimilar and we present the details for Flickr to savespace. Examples of experimental results are presentednext.

The number of pseudo-class labels c is varied from 2to 20 with an incremental step of 1 while setting {α =0.1, β = 0.1,K = 70}. The performance variation w.r.t.c and the number of features is depicted in Figure 6.We note that when c varies from 5 to 11, the clusteringperformance is not sensitive to c, especially when thenumber of selected feature is large.


TABLE 6Clustering Performance with Different Feature Selection Algorithms in BlogCatalog

# Features Accuracy NMIUDFS LapScore SPEC LUFS UDFS LapScore SPEC LUFS

200 41.16 41.07 40.33 45.87 0.1842 0.1328 0.1819 0.2149300 51.86 49.83 51.16 55.70 0.1807 0.1755 0.1789 0.2121400 49.28 48.93 48.55 52.72 0.1789 0.1707 0.1768 0.2026500 49.01 48.59 48.66 51.41 0.1764 0.1746 0.1726 0.2040600 47.96 47.75 47.43 50.93 0.1741 0.1753 0.1765 0.2016700 47.31 46.84 46.93 50.68 0.1725 0.1757 0.1688 0.2014800 46.98 46.69 46.73 50.35 0.1744 0.1730 0.1742 0.1994900 46.98 46.76 46.60 48.95 0.1702 0.1675 0.1675 0.19271000 45.36 45.09 45.54 47.20 0.1673 0.1615 0.1654 0.18918189 38.65 38.65 38.65 38.65 0.1540 0.1540 0.1540 0.1540

2 3 4 5 6 7 8 9 1011121314151620

200400

600800

1000

0

10

20

30

40

# of Pseudo−class# of Features

Accura

cy(%

)

(a) Accuracy

2 3 4 5 6 7 8 9 1011121314151620

200400

600800

1000

0

0.05

0.1

0.15

0.2

# of Pseudo−class# of Features

NM

I

(b) NMI

Fig. 6. Number of Pseudo-class Labels vs Number ofFeatures

Fixing c = 9, α = 0.1 and β = 0.1, we varythe number of social dimensions from 10 to 100with an incremental step of 10 and the performancevariation w.r.t. the number of social dimensions andthe number of features is demonstrated in Figure 7.Most of the time, with the increasing number of socialdimensions, the performance first increases, reachesits peak value and degrades. Based on social dimen-sion assumption, labels of instances from the samesocial dimension are likely to be similar while labelsof instances from the different social dimension arelikely to be dissimilar. When fewer social dimensionsare selected, it is likely that instances with differentlabels might be mixed in the same social dimension,while more social dimensions are selected, with a highprobability, instances with the same label might be

1030

5070

90

400

600

800

100020

25

30

35

# of Social Dimension# of FeaturesA

ccu

racy(%

)

(a) Accuracy

1030

5070

90

400

600

800

10000

0.05

0.1

0.15

0.2

# of Social Dimension# of Features

NM

I

(b) NMI

Fig. 7. Number of Social Dimension vs Number ofFeatures

1e−61e−3

1e−20.1

0.30.5

0.71

400

600

800

100020

25

30

35

α# of Features

Accu

racy(%

)

(a) Accuracy

1e−61e−3

1e−20.1

0.30.5

0.71

400

600

800

10000

0.05

0.1

0.15

0.2

α# of Features

NM

I

(b) NMI

Fig. 8. α vs Number of Features

divided into several social dimensions. This patterncan be used to determine the optimal number of socialdimensions for LUFS.

Fixing c = 9, K = 70 and β = 0.1, we vary αas {1e−6, 1e−3, 1e−2, 0.1, 0.3, 0.5, 0.7, 1}. The perfor-mance variation w.r.t. α and the number of features isdepicted in Figure 8. The performance first increasesand most of time, the peak values for both accuracyand NMI are achieved when α = 0.1, indicatingthe importance of social dimension regularization forLUFS. After α = 0.5, the performance is dramaticallydegraded, suggesting that only link information is notenough for LUFS.

To study how β and the number offeatures affect the performance, we vary β as{1e−6, 1e−3, 1e−2, 0.1, 0.3, 0.5, 0.7, 1} and set c = 9,K = 70 and β = 0.1. The results are shown inFigure 9. We observe that the performance improvesas β changes from 1e−3 to 1e−2 and from 1e−2 to0.1. These results demonstrate the capability of the


1e−61e−3

1e−20.1

0.30.5

0.71

400

600

800

100015

20

25

30

35

β# of Features

Accu

racy(%

)

(a) Accuracy

1e−61e−3

1e−20.1

0.30.5

0.71

400

600

800

10000

0.05

0.1

0.15

0.2

β# of Features

NM

I

(b) NMI

Fig. 9. β vs Number of Features

0 20 40 60 80 100800

900

1000

1100

1200

1300

1400

# of Social Dimensions

With

in D

ista

nce

Social Dimension

Random Assignment

(a) Dwithin for Flickr

0 20 40 60 80 1000

100

200

300

400

500

600


Be

twe

en

Dis

tan

ce

Social Dimension

Random Assignment

(b) Dbetween for Flickr

Fig. 10. Within and Between Distance Achieved bySocial Dimension and Random Assignment in Flickr

ℓ2,1-norm for feature selection.Among the these four parameters of LUFS, β is

most sensitive, the number of pseudo-class labels, thenumber of social dimensions and α are not so.

4.6 Probing Further

A key contribution of LUFS to the performanceimprovement is to exploit link information. Socialdimension regularization (SDR) always outperformsgraph regularization in terms of capturing link in-formation. SDR employs social dimensions extractedfrom linked data. Hence, we would like to probe fur-ther why the use of social dimensions works. One wayis to investigate whether instances in the same socialdimension are similar and instances from differentsocial dimensions are dissimilar.

Let z = {z1, z2, . . . , zK} be the K social dimensionsgiven by the social dimension extraction algorithm,i.e., Modularity Maximization [26] here, with sizes of{n1, n2, . . . , nK} where ni is number of instances inthe i-th social dimension and

∑

i ni = n. To create

0 20 40 60 80 100300

350

400

450

500

550


With

in D

ista

nce

Social Dimension

Random Assignment

(a) Dwithin for BlogCatalog

0 20 40 60 80 1000

50

100

150

200

250


Be

twe

en

Dis

tan

ce

Social Dimension

Random Assignment

(b) Dbetween for BlogCatalog

Fig. 11. Within and Between Distance Achieved bySocial Dimension and Random Assignment in Blog-Catalog

reference groups in comparison with social dimen-sions, we also randomly divide these n instancesinto K groups with sizes of {n1, n2, . . . , nK}. Letz′ = {z′1, z′2, . . . , z′K} be the set of these randomlyformed groups with sizes of {n1, n2, . . . , nK} and theneach group in z′ (e.g., z′i) corresponds to a socialdimension in z (e.g., zi). The label information isused earlier in assessing clustering performance. Weuse it again here. Let Y′ be the class label indicatormatrix. We center Y′ as: Y′ = Y′P. Two distancemetrics are defined: Dwithin and Dbetween for within-and between-social dimension distance. Dwithin andDbetween can be obtained from within social dimen-sion scatter matrix Sw and between social dimensionscatter matrix Sb, respectively,

Dwithin = Tr(Sw), Dbetween = Tr(Sb). (33)

With the increase of the number of social dimen-sions, Dwithin trends to decrease and Dbetween tendsto increase in both datasets. For each specific numberof social dimensions, K, we calculate Dwithin andDbetween for z and z′. Varying K from 2 to 100 with anincremental step of 1, we obtain 99 pairs of Dwithin

and 99 pairs of Dbetween. The results for Flickr andBlogCatalog are shown in Figure 10 and Figure 11,respectively.Dwithin and Dbetween change much faster for so-

cial dimensions, comparing with groups of randomassignment. Moreover, Dwithin of a social dimensionis much smaller than that of a random assignmentgroup, thus, instances in the same social dimensionare of similar labels. Dbetween of a social dimensionis much larger than that of a random assignmentgroup, indicating that instances from different socialdimensions are dissimilar.

We also perform a two-sample t-test on these pairsof Dwithin of z and z′ at significant level 0.001.The null hypothesis, H0, is that there is no differ-ent between these pairs; the alternative hypothesis,H1, is that Dwithin of a social dimension is lessthan that of the corresponding random assignmentgroup. The t-test results, p-values, are 6.0397e−060and 8.2406e−083 on Flicker and BlogCatalog, respec-tively. Hence, there is strong evidence to reject the nullhypothesis. We conduct a similar test for the pairs ofDbetween, and there is strong evidence to support thatDbetween of a social dimension is significantly largerthan that of its counterpart, a random assignmentgroup. The evidence from both figures and t-testconfirms the positive impact of the constraint fromlink information via social dimensions.

5 RELATED WORK

Traditionally, feature selection algorithms can be ei-ther supervised or unsupervised [8], [21] based onthe training data being labeled or unlabeled. Recentreviews about supervised and unsupervised featureselection can be found in [32] and [1], respectively.

Supervised feature selection methods [21] can bebroadly categorized into the wrapper models [9], [19]


and the filter models [15], [28]. The wrapper modeluses the predictive accuracy of a predetermined learn-ing algorithm to determine the quality of selectedfeatures. These methods can be egregiously expensiveto run for data with a large number of features [7],[14]. The filter model separates feature selection fromclassifier learning so that the bias of a learning al-gorithm does not interact with the bias of a fea-ture selection algorithm. It relies on measures of thegeneral characteristics of the training data such asdistance, consistency, dependency, information, andcorrelation [15]. Many researchers paid great attentionto developing unsupervised feature selection [37],[16], [5]. Unsupervised feature selection [16], [8], [39]is a less constrained search problem without classlabels, depending on clustering quality measures [11],[10], and can eventuate many equally valid featuresubsets. With high-dimensional data, it is likely tofind many sets of features that seem equally goodwithout considering additional constraints. Anotherkey difficulty is how to objectively measure the resultsof feature selection. A wrapper model is proposedin [8] to use a clustering algorithm in evaluating thequality of feature selection.

Recently, sparsity regularization, such as the ℓ2,1-norm of a matrix [6], in dimensionality reductionhas been widely investigated and applied to featureselection including multi-task feature selection [2],[22], robust joint ℓ2,1-Norms [27], spectral feature se-lection [39], discriminative unsupervised feature se-lection [38]. Through sparsity regularization, featureselection can be embedded in the learning process.

In [13], the authors proposed a supervised featureselection framework FSNet to select features for net-worked data. It adopts linear regression to fit thecontent information, and graph regularization to cap-ture the link information. The first attempt to selectfeatures on social media data is LinkedFS [31], asemi-supervised algorithm. Various relations (coPost,coFollowing, coFollowed and Following) are extractedfollowing social correlation theories [24]. LinkedFSsignificantly improves the performance of feature se-lection by incorporating these relations into featureselection. LinkedFS is substantially different from FS-Net - (1) FSNet is supervised while LinkedFS is semi-supervised; (2) LinkedFS extracts four relations fromlinked data according to social correlation theories,while FSNet utilizes the network directly, which is aspecial case of these used by LinkedFS; (3) Formula-tions and optimization algorithms of LinkedFS are dif-ferent from these of FSNet. Our proposed frameworkLUFS are distinctively different from both LinkedFSand FSNet. First, LinkedFS and FSNet are semi-supervised and supervised methods, respectively andthey use label information, while LUFS is a unsuper-vised feature selection algorithm. Second, LinkedFSand FSNet exploit relations individually while LUFSemploys relations as groups via social dimensions.

6 CONCLUSION

Linked data in social media presents new challengesto traditional feature selection algorithms, which as-sume the data instances to be independent and iden-tically distributed. In this paper, we study a newproblem of selecting feature for linked data in socialmedia. We utilize graph regularization and socialdimension regularization to capture the individualand group behaviors of linked instances, separately. Inparticular, social dimension regularization is based ona recent developed concept of social dimensions fromsocial network analysis to extract relations amonglinked data as groups and defined to mathematicallymodel these relations. We then propose the conceptof pseudo-class labels to guide extracting constraintsfrom link information and attribute-value informa-tion, resulting in a new unsupervised feature selec-tion framework, LUFS, for linked social media data.Experimental results on two datasets from real-worldsocial media websites show that the proposed methodcan effectively exploit link information in comparisonwith the state-of-the-art unsupervised feature selec-tion methods.

There are several interesting directions to be investi-gated. First, in social media networks, the availabilityof various link formations can lead to networks withrelationships of different strengths, which means thatweak links and strong links are often mixed together.Since strong links indicate strong correlations amonginstances, treating all links with a equal weight willincrease the level of noise in the learned models andlikely lead to degradation in learning performance.One way to further exploit link information is toincorporate tie strength prediction into LUFS.

Second, two characteristics of linked data (i.e., con-centrated linkage and autocorrelation) can reduce theeffective size of instances for learning. As users insocial media are both content producers and contentconsumers, the quality of social media content canvary drastically. In other words, social media dataconsists of useful and noisy data instances. Therefore,it makes sense to select instances to help better selectrelevant features, or vice versa.

Third, another characteristic of social media is thatits data comes from a range of multiple sources.For example, tweets in Twitter are always related toexternal sources such as news from BBC; tags andshort text descriptions provide semantic informationabout photos in Flickr. Having multi-source data cancompensate for some problems with social mediadata. Since data of each source can be noisy, partial, orredundant, selecting relevant sources and using themtogether can help effective feature selection

Finally, we believe that the concept of pseudo-class labels introduced in the paper is a powerfulmeans to effectively constrain the learning space ofunsupervised feature selection and can be extended todifferent applications without labeled data but addi-tional information. This work defines a new researchproblem of feature selection for social media data and


shows its potential and significance, but only touchesupon the tip of the iceberg of this fertile research area.Facing challenges arising from learning and mining ofsocial media data, we anticipate many new problemsto be researched and a suite of new methods andsolutions to be developed.

ACKNOWLEDGMENTS

This project is, in part, supported by the NationalScience Foundation grants #0812551, IIS-1217466,andONR #N000141410095.

REFERENCES

[1] A. Salem, J. Tang, and H. Liu. Feature selection for clustering:A review. In Data Clustering: Algorithms and Applications, CRCPress, 2013.

[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task featurelearning. NIPS, 2007.

[3] S. Boyd and L. Vandenberghe. Convex optimization. CambridgeUniv Pr, 2004.

[4] D. Cai, C. Zhang, and X. He. Unsupervised feature selectionfor multi-cluster data. In KDD, 2010.

[5] C. Constantinopoulos, M. Titsias, and A. Likas. Bayesianfeature and model selection for gaussian mixture models.TPAMI, 2006.

[6] C. Ding, D. Zhou, X. He, and H. Zha. R 1-pca: rotationalinvariant l 1-norm principal component analysis for robustsubspace factorization. In ICML, 2006.

[7] R. Duda, P. Hart, D. Stork, et al. Pattern classification, volume 2.wiley New York, 2001.

[8] J. Dy and C. Brodley. Feature selection for unsupervisedlearning. JMLR, 2004.

[9] J. G. Dy and C. E. Brodley. Feature subset selection and orderidentification for unsupervised learning. In ICML, 2000.

[10] J. G. Dy and C. E. Brodley. Visualization and interactive featureselection for unsupervised data. In KDD, 2000.

[11] J. G. Dy, C. E. Brodley, A. C. Kak, L. S. Broderick, and A. M.Aisen. Unsupervised feature selection applied to content-based retrieval of lung images. TPAMI,2003.

[12] E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membershipmodels of scientific publications. PNAS, 2004.

[13] Q. Gu and J. Han. Towards feature selection in network. InCIKM, 2011.

[14] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selec-tion for cancer classification using support vector machines.Machine learning, 2002.

[15] M. Hall. Correlation-based feature selection for discrete andnumeric class machine learning. In ICML, 2000.

[16] X. He, D. Cai, and P. Niyogi. Laplacian score for featureselection. NIPS,507–514,2006.

[17] R. Horn and C. Johnson. Matrix analysis. Cambridge Univ Pr,1990.

[18] G. John, R. Kohavi, and K. Pfleger. Irrelevant feature and thesubset selection problem. ICML, 1994.

[19] Y. Kim, W. Street, and F. Menczer. Feature selection forunsupervised learning via evolutionary search. In KDD, 2000.

[20] H. Liu and H. Motoda. Computational methods of featureselection. Chapman & Hall, 2008.

[21] H. Liu and L. Yu. Toward integrating feature selection algo-rithms for classification and clustering. TKDE, 2005.

[22] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficientl 2, 1-norm minimization. In UAI,2009.

[23] U. Luxburg. A tutorial on spectral clustering. Statistics andComputing, 17(4):395–416, 2007.

[24] P. Marsden and N. Friedkin. Network studies of social influ-ence. Sociological Methods and Research, 22(1):127–151, 1993.

[25] M. McPherson, L. S. Lovin, and J. M. Cook. Birds of a feather:Homophily in social networks. Annual Review of Sociology,2001.

[26] M. E. J. Newman and M. Girvan. Finding and evaluatingcommunity structure in networks. Physical review E,026113,2004.

[27] F. Nie, H. Huang, X. Cai, and C. Ding. Efficient and robustfeature selection via joint l21-norms minimization. NIPS, 1813–1821, 2010.

[28] H. Peng, F. Long, and C. Ding. Feature selection based on mu-tual information: criteria of max-dependency, max-relevance,and min-redundancy. TPAMI, 2005.

[29] M. Plummer and L. Lovasz. Matching theory. Access Onlinevia Elsevier, 1986.

[30] V. Roth and T. Lange. Feature selection in clustering problems.NIPS,2004.

[31] J. Tang and H. Liu. Feature selection with linked data in socialmedia. In SDM, 2012.

[32] J. Tang, A. Salem and H. Liu. Feature selection for classifica-tion: A review. In Data classification: Algorithms and Applications,CRC Press, 2014.

[33] J. Tang and H. Liu. Unsupervised feature selection for linkedsocial media data. In KDD, 2012.

[34] L. Tang and H. Liu. Relational learning via latent socialdimensions. In KDD, 2009.

[35] B. Taskar, P. Abbeel, M. Wong, and D. Koller. Label and linkprediction in relational data. In SRL, 2003.

[36] X. Wang, L. Tang, H. Gao, and H. Liu. Discovering overlap-ping groups in social media. In ICDM, 2010.

[37] L. Wolf and A. Shashua. Feature selection for unsupervisedand supervised inference: the emergence of sparsity in aweighted-based approach. JMLR, 2005.

[38] Y. Yang, H. Shen, Z. Ma, Z. Huang, and X. Zhou. L21-normregularized discriminative feature selection for unsupervisedlearning. In IJCAI, 1589-1594, 2011.

[39] Z. Zhao and H. Liu. Spectral feature selection for supervisedand unsupervised learning. In ICML, 2007.

[40] Z. Zhao, L. Wang, and H. Liu. Efficient spectral featureselection with minimum redundancy. In AAAI, 2010.

Jiliang Tang is a senior PhD student of Com-puter Science and Engineering at ArizonaState University. He obtained his Master de-gree in Computer Science and Bachelor de-gree in Software Engineering at Beijing Insti-tute of Technology in 2008 and 2010, respec-tively. His research interests are in computingwith online trust, mining social media data,social computing, feature selection and datamining. He has published innovatives worksin highly ranked journals and top conference

proceedings such as ACM TKDD, DMKD, ACM SIGKDD, WWW,WSDM, IJCAI, SDM, ICDM, and CIKM. Updated information can befound at http://www.public.asu.edu/∼jtang20.

Dr.Huan Liu is a professor of ComputerScience and Engineering at Arizona StateUniversity. He obtained his Ph.D. in Com-puter Science at University of Southern Cal-ifornia and B.Eng. in Computer Science andElectrical Engineering at Shanghai JiaoTongUniversity. Before he joined ASU, he workedat Telecom Australia Research Labs and wason the faculty at National University of Sin-gapore. He was recognized for excellence inteaching and research in Computer Science

and Engineering at Arizona State University. His research interestsare in data mining, machine learning, social computing, and artificialintelligence, investigating problems that arise in many real-world,data-intensive applications with high-dimensional data of disparateforms such as social media. His well-cited publications includebooks, book chapters, encyclopedia entries as well as conferenceand journal papers. He serves on journal editorial boards andnumerous conference program committees, and is a founding orga-nizer of the International Conference Series on Social Computing,Behavioral-Cultural Modeling, and Prediction (http://sbp.asu.edu/).He is an IEEE Fellow and an ACM Distinguished Scientist. Updatedinformation can be found at http://www.public.asu.edu/∼huanliu.

Documents

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ...tangjili/publication/LUFS_TKDE.pdfIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING APRIL 2012 1 An Unsupervised Feature Selection Framework