24
Learning to Classify Users in Online Interaction Networks Georgios Rizos, Symeon Papadopoulos, and Yiannis Kompatsiaris Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) ICCSS 2015, June 10, 2015, Helsinki, Finland

Learning to Classify Users in Online Interaction Networks

Embed Size (px)

Citation preview

Learning to Classify Users in Online Interaction NetworksGeorgios Rizos, Symeon Papadopoulos, and Yiannis Kompatsiaris

Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)

ICCSS 2015, June 10, 2015, Helsinki, Finland

User Classification

#2

Twitter Handle Labels@nytimes usa, press,

new york@HuffPostBiz finance@BBCBreaking press,

journalist, tv@StKonrath journalist

Examples from SNOW 2014 dataset

User Classification in (and outside) OSNs

#3

OSN

online activities

log filesAPIs

Behaviour

Observation

Profiling/Classification

Network-based User Classification

• People with similar interests tend to connect (homophily)

• Knowing about one’s connections could reveal information about them

• Knowing aboutthe whole networkstructure could revealeven more…

#4

Related Work: User Classification

Graph-based semi-supervised learning:• Label propagation (Zhu and Ghahramani, 2002)• Local and global consistency (Zhou et al., 2004)• Empirical evaluation of many graph kernels (Fouss et al., 2012)

Other approaches to user classification:• Hybrid feature engineering for inferring user behaviors

(Pennacchiotti et al., 2011 , Wagner et al., 2013)• Crowdsourcing Twitter list keywords for popular users

(Ghosh et al., 2012)• Content-based, graph-regularized NMF for spammer detection

(Hu et al., 2013)

#5

Related Work: Graph Feature Extraction

First attempts at using community detection:• EdgeCluster: Edge centric k-means (Tang and Liu, 2009)• MROC: Binary tree community hierarchy (Wang et al., 2013)

Low-rank matrix representation methods:• Laplacian Eigenmaps: k eigenvectors of the graph Laplacian

(Belkin and Niyogi, 2003 , Tang and Liu, 2011)• Random-Walk Modularity Maximization: Does not suffer from

the resolution limit of ModMax (Devooght et al., 2014)• Deepwalk: Deep representation learning (Perozzi et al., 2014)

#6

Overview of Framework

#7

Online social interactions (retweets, mentions, etc.)

Social interaction user graph

ARCTE

Partial/Sparse Annotation

Unsupervised graph feature representation

Supervised graph feature representation

Feature Weighting

User Label Learning

Classified Users

Network Features using ARCTE

• Based on user-centric community detection.• We extract for each user, two types of user-centric

communities.• Base user-centric community: • Extended user-centric community: Consider a vector that

contains similarity values among the seed user and all the rest of the users.– By truncating appropriately, we can keep a community of the

most similar users to the seed .– We keep the fewest possible users such that we still include the

seed user’s direct neighbors.• Denote the set of communities detected by We form the

feature matrix as follows:#8

ARCTE: Toy Example

#9

Fast Approximate User-centric PageRank

• Given a seed user , we calculate the user-centric PageRank vector (i.e. stationary distribution with probability 1 at ).

• Localized, sparse vector; i.e. we neither propagate nor store trivial values.

• Instead of approximating the PageRank vector, we approximate cumulative PageRank differences. Better approximation for fewer iterations.

• We alternate between two update rules:– Cumulative PR diff:(instead of PR: , (Andersen et al., 2006))– Residual distribution: where : Restart probability and the -th row of and the -th row of

• Finally, we divide each element of by its degree in order to get approximate, user-centric, regularized commute-times.

#10

Community Weighting

• We perform a supervised community weighting step to boost the importance of highly predictive communities.

• For each community we calculate a weight:

• The first factor is based on supervised chi-squared weighting that quantifies the correlation among all feature-label pairs.– PSNR aggregation across labels:

• The second factor is unsupervised inverse vertex frequency.– Consider idf with vertices as terms and communities as documents.

• We multiply each column of with the corresponding weight.

#11

Evaluation: Dataset Description

#12

Datasets Labels Vertices Vertex Type Edges Edge Type

SNOW2014 Graph(Papadopoulos et al., 2014)

90 533,874 Twitter Account

949,661 Mentions + Retweets

IRMV-PoliticsUK(Greene & Cunningham, 2013)

5 419 Twitter Account

11,349 Mentions + Retweets

ASU-YouTube(Mislove et al., 2007)

47 1,134,890 YouTube Channel

2,987,624 Subscriptions

ASU-Flickr(Tang and Liu, 2009)

195 80,513 Flickr Account 5,899,882 Contacts

Ground truth generation:• SNOW2014 Graph: Twitter list aggregation & post-processing• IRMV-PoliticsUK: Manual annotation• ASU-YouTube: User membership to group• ASU-Flickr: User subscription to interest group

Evaluation: SNOW 2014 dataset

#13

SNOW2014 Graph (534K, 950K): Twitter mentions + retweets ground truth based on Twitter list processing

Evaluation: Insight Politics UK

#14

Insight-Multiview-PoliticsUK (419, 11K): mentions + retweets ground truth based on manual annotation

Evaluation: ASU-YouTube

#15

ASU-YouTube (1.1M, 3M): YouTube subscriptions ground truth based on membership to groups

Evaluation: ASU-Flickr

#16

ASU-Flickr (80K, 5.9M): Flickr contacts ground truth based on membership to Flickr groups

Evaluation: Community Weighting

#17

Conclusion

• Key ideas:– new user feature representation based on user-centric

communities– community weighting based on sparse annotations– consistently good performance both on interaction

(mention/retweet) and affiliation (follow/subscribe) graphs

• Future Work:– integration of additional signals (content)– investigating feasibility on other classification problems,

e.g. spammer detection

#18

Thank you!• Resources:

Slides: http://www.slideshare.net/sympapadopoulos/learning-to-classify-users-in-online-interaction-networks Code: https://github.com/MKLab-ITI/reveal-user-classification

https://github.com/MKLab-ITI/reveal-user-annotation

• Get in touch:@sympapadopoulos / [email protected]@georgios_rizos / [email protected]

#19

References (1/3)

• Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6), 1373-1396.

• Tang, L., & Liu, H. (2011). Leveraging social media networks for classification. Data Mining and Knowledge Discovery, 23(3), 447-478.

• Devooght, R., Mantrach, A., Kivimäki, I., Bersini, H., Jaimes, A., & Saerens, M. (2014, April). Random walks based modularity: application to semi-supervised learning. In Proceedings of the 23rd international conference on World wide web (pp. 213-224). International World Wide Web Conferences Steering Committee.

• Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.

• Tang, L., & Liu, H. (2009, November). Scalable learning of collective behavior based on sparse social dimensions. In Proceedings of the 18th ACM conference on Information and knowledge management (pp. 1107-1116). ACM.

• Wang, X., Tang, L., Liu, H., & Wang, L. (2013). Learning with multi-resolution overlapping communities. Knowledge and information systems, 36(2), 517-535.

#20

References (2/3)• Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label

propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University.• Zhou, D., Bousquet, O., Lal, T. N., Weston, J., & Schölkopf, B. (2004). Learning with local and

global consistency. Advances in neural information processing systems, 16(16), 321-328.• Fouss, F., Francoisse, K., Yen, L., Pirotte, A., & Saerens, M. (2012). An experimental

investigation of kernels on graphs for collaborative recommendation and semisupervised classification. Neural Networks, 31, 53-72.

• Pennacchiotti, M., & Popescu, A. M. (2011, August). Democrats, republicans and starbucks afficionados: user classification in twitter. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 430-438). ACM.

• Ghosh, S., Sharma, N., Benevenuto, F., Ganguly, N., & Gummadi, K. (2012, August). Cognos: crowdsourcing search for topic experts in microblogs. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 575-590). ACM.

• Hu, X., Tang, J., Zhang, Y., & Liu, H. (2013, August). Social spammer detection in microblogging. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (pp. 2633-2639). AAAI Press.

• Wagner, C., Asur, S., & Hailpern, J. (2013, September). Religious politicians and creative photographers: Automatic user categorization in twitter. In Social Computing (SocialCom), 2013 International Conference on (pp. 303-310). IEEE.

#21

References (3/3)

• Andersen, R., Chung, F., & Lang, K. (2006, October). Local graph partitioning using pagerank vectors. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on (pp. 475-486). IEEE.

• Papadopoulos, S., Corney, D., & Aiello, L. M. (2014). SNOW 2014 Data Challenge: Assessing the Performance of News Topic Detection Methods in Social Media. In SNOW-DC@ WWW (pp. 1-8).

• Greene, D., & Cunningham, P. (2013, May). Producing a unified graph representation from multiple social network views. In Proceedings of the 5th Annual ACM Web Science Conference (pp. 118-121). ACM.

• Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., & Bhattacharjee, B. (2007, October). Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement (pp. 29-42). ACM.

• Tang, L., & Liu, H. (2009, June). Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 817-826). ACM.

#22

Auxiliary Slides

#23

Classifying Users using Network Structure

• User-centric community detection to the problem of graph-based user classification. We name our approach ARCTE.

• Improved approximate, user-centric PageRank calculation for better local graph exploration.

• Supervised community weighting step that boosts the importance of highly predictive communities in the feature representation.

• Extensive comparative study of numerous state-of-the-art network feature extraction methods on several social interaction datasets.

#24