Feature Selection with Linked Data in Social Mediacse.msu.edu/~tangjili/publication/SDM12.pdf ·...

Preview:

Citation preview

Data Mining and Machine Learning Lab

Feature Selection with Linked Data

in Social Media

Jiliang Tang and Huan Liu

Computer Science and Engineering

Arizona State University

April 26-28, 2012 SDM2012

Social Media

• Explosion of social media generates massive

data in an unprecedented rate

- 200 million Tweets per day

- 3,000 photos in Flickr per minute

-153 million blogs posted per year

Social Media Data

• Massive and high-dimensional social media data

poses challenges to data mining tasks

- Scalability

- Curse of dimensionality

• Feature selection is an effective way to prepare

large-scale, high-dimensional data for effective

data mining

Feature Selection

• Traditional feature selection algorithms

work with “flat" data (attribute-value data)

- Independent and Identically Distributed (i.i.d.)

• Social media data differs from attribute-

value data

- Inherently linked

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

Users

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

Posts

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

User-post

relations

An Example of Social Media Data

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

User-user

following

Representation for Attribute Value Data

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Posts

Representation for Attribute Value Data

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 …. Features

Representation for Attribute Value Data

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Labels

Representation for Social Media Data

User-post relations

1

1 1 1

1

1 1

𝑢1

𝑢2

𝑢3

𝑢4

𝑢1 𝑢2 𝑢3 𝑢4

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Representation for Social Media Data

1

1 1 1

1

1 1

𝑢1

𝑢2

𝑢3

𝑢4

𝑢1 𝑢2 𝑢3 𝑢4

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

User-user relations

Representation for Social Media Data

1

1 1 1

1

1 1

𝑢1

𝑢2

𝑢3

𝑢4

𝑢1 𝑢2 𝑢3 𝑢4

𝑝1

𝑝2

𝑝3

𝑝5

𝑝6

𝑝4

𝑝7 𝑝8

𝑓1 𝑓2 𝑓𝑚 …. …. …. 𝑐1 𝑐𝑘 ….

Social

Context

Problem Statement

• Given labeled data X and its label indicator matrix Y, the

whole dataset F, its social context including user-user

following relationships S and user-post relationships P, we

aim to select K most relevant features from m features on

the dataset F with its social context S and P.

Two Fundamental Problems

• Relation extraction

- What are distinctive relations that can be

extracted from linked data

• Mathematical representation

- How to use these relations in feature selection

formulation

𝑢1

𝑢2

𝑢3

𝑢4

𝑝1 𝑝2

𝑝3 𝑝5

𝑝6

𝑝4

𝑝7

𝑝8

Relation Extraction

coPost

• A user can have

multiple posts

coFollowing

𝑢1 𝑢3

𝑝1 𝑝2

𝑝6

𝑝7

𝑢4 𝑝8 • Two users

follow a

third user

coFollowed

𝑢1

𝑢2 𝑝1 𝑝2

𝑝3 𝑝5 𝑝4

𝑢4 𝑝8 • Two users

are followed

by a third

user

Following

𝑢1

𝑢2 𝑝1 𝑝2

𝑝5 𝑝4

• A user follows

another user

𝑝3

Post-Post relations

• What do these relations suggest for posts?

Social Correlation Theories

• Homophily

- People with similar interests are more likely to be

linked

• Social influence

- People that are linked are more likely to have

similar interests

CoPost Hypothesis

• CoPost Hypothesis

- Posts by the same user are more likely to be of

similar topics

𝑢2

𝑝5 𝑝4

𝑝3

CoFollowing Hypothesis

• CoFollowing

Hypothesis

- If two users follow

the same user, their

posts are likely of

similar topics.

𝑢1 𝑢3

𝑝1 𝑝2

𝑝6

𝑝7

𝑢4 𝑝8

CoFollowed Hypothesis

• CoFollowed

Hypothesis

- If two users are followed

by the same user, their

posts are likely of similar

topics

𝑢1

𝑢2 𝑝1 𝑝2

𝑝5 𝑝4

𝑢4 𝑝8

𝑝3

Following Hypothesis

• Following

Hypothesis

- If one user follows

another, their posts are more

likely similar in terms of

topics

𝑢1

𝑢2 𝑝1 𝑝2

𝑝3 𝑝5 𝑝4

Modeling CoFollowing Relation

• Two co-following users have similar interested topics

||||

)(^

k

Ff

i

T

k

Ff

i

kF

fW

F

fT

uT kiki

)(

• Users' topic interests

u Nuu

jiF

T

uji

uTuT,

2

2

^^

1,2

2

W||)()(||||W||||YWX||min

A Reformulation of CoFollowing Relation

• It is equivalent to

ji

j

pofauthortheisuifF

jiH

where

||

1),(

XYEHFFHLXXB

||W||EW)2BWTr(Wmin

TTTT

FI

T

1,2

T

W

A Unique Problem for LinkedFS

• LinkedFS framework is designed to solve

the following optimization problem

1,2

T

W||W||EW)2BWTr(Wmin

LinkedFS

Datasets

• BlogCatalog

- Undirected following

http://dmml.asu.edu/users/xufei/datasets.html

• Digg

- Directed Following

http://www.public.asu.edu/~ylin56/kdd09sup.html

Data Characteristics

Experiment Setting

• Metric

- Classification accuracy

- Classifier : LibSVM

• Baseline methods

- ttest (TT)

- InformationGain (IG)

- FisherScore (FS)

- Joint 2,1-Norms(RFS)

Training and Testing

• Testing (50%) and Training (50%)

• Subsample 5%, 25%, 50% from training

data to construct another three training sets

• Numbers of Selected Features

- ( 50,100,200,300)

Results on Digg

Results on Digg

Performance Improvement

Conclusions

• Investigate a new problem of feature selection for

social media data

• Provide a way to capture link information guided

by social correlation theories

• Propose an effective framework, LinkedFS, for

social media feature selection

Future Work

• Sophisticated ways to exploit social context

• Lack of label information (unsupervised)

• Noise and incomplete social media data

• The strength of social ties ( strong and weak ties

mixed)

Acknowledgments

This work is, in part, sponsored by National Science

Foundation via a grant (#0812551). Comments and

suggestions from DMML members and reviewers are

greatly appreciated.

Questions

Recommended