20
The Problem The Data The Method Matching Profiles of Facebook and VK Users The First International Conference on Social Network Analysis, Higher School of Economics, Moscow, Russia Alexander Panchenko 1,2 , Dmitry Babaev 1 , Sergey Objedkov 3 [email protected] 1 – Digital Society Laboratory, 2 – UCLouvain, 3 – HSE November 21, 2014 Alexander Panchenko Matching Profiles of Facebook and VK Users

Matching Profiles of Facebook and VK UsersIndex

Embed Size (px)

DESCRIPTION

People often use several social networks at once as they provide complimentary features and user bases. In Russia, an average user is registered in 2.5 social networks, according to director of the “Moi Mir” social network. Thus, information about one user is scattered across different networks. A profile matching algorithm takes as input a user profile of one social network and returns profile of the same person in another social network if any. Integration of information from various platforms can lead to a better representation of a user in applications such as internet marketing, search or recommendation systems. No wonder several researchers recently tried to tackle this problem (see Bartunov et al., Veldman, Malhotra et al., Sironi, Balduzzi et al., Jain et al.). We present a new user identity resolution approach that uses minimal supervision and achieves precision of 0.98 and recall of 0.54. Unlike most existing approaches, the method is easily parallelizable and computationally efficient and can be used to process an entire social network in a matter of hours. We used it to match Facebook, the most popular social network globally, with VKontakte, the most popular social network with Russian-speaking users. To the best of our knowledge it is the most large-scale matching experiment to date. While most prior experiments operated on datasets ranging from thousands to hundreds of thousands of profiles, we performed a match of 90 million of VKontakte profiles to 3 million of Facebook.com profiles. The algorithm consists of three phases: 1. Candidate generation. For each VK profile we retrieve a set of FB profiles with similar first and second names. 2. Candidate ranking. The candidates are ranked according to similarity of their friends. 3. Selection of the best candidate. The goal of the final step is to select the best match from the list of candidates. Each profile from VK network is processed independently and hence this operation can be easily parallelized (we rely on the MapReduce framework). In our future work, we plan to use supervised learning in order to learn threshold used by candidate ranking and best candidate selection stages. References M. Balduzzi, C. Platzer, T. Holz, E. Kirda, D. Balzarotti, and C. Kruegel. Abusing social networks for automated user profiling. In Recent Advances in Intrusion Detection, pages 422–441. Springer, 2010. S. Bartunov, A. Korshunov, S.-T. Park, W. Ryu, and H. Lee. Joint link-attribute user identity resolution in online social networks. In Proc. of the Sixth SNA-KDD Workshop at KDD, 2012. P. Jain, P. Kumaraguru, and A. Joshi. @ i seek’fb. me’: identifying users across multiple online social networks. In Proceedings of the 22nd international conference on World Wide Web companion, pages 1259–1268. International World Wide Web Conferences Steering Committee, 2013. A. Malhotra, L. Totti, W. Meira Jr, P. Kumaraguru, and V. Almeida. Studying user footprints in different o

Citation preview

Page 1: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Matching Profiles of Facebook and VK UsersThe First International Conference on Social Network Analysis,

Higher School of Economics, Moscow, Russia

Alexander Panchenko1,2, Dmitry Babaev1, Sergey Objedkov3

[email protected]

1 – Digital Society Laboratory, 2 – UCLouvain, 3 – HSE

November 21, 2014

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 2: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Outline

1 The Problem

2 The Data

3 The Method

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 3: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Outline

1 The Problem

2 The Data

3 The Method

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 4: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Users in Russia has 2.5 profiles in average

Facebook (FB) and VKontakte (VK)

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 5: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Problem

Motivation

input: a user profile of one social networkoutput: profile of the same person in another social networkimmediate applications in marketing, search, security, etc.

Contribution

user identity resolution approachprecision of 0.98 and recall of 0.54the method is computationally effective and easily parallelizable

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 6: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Related Work

Several researchers recently tried to tackle this problem:Balduzzi et al. Abusing social networks for automated userprofiling. Springer, 2010.Bartunov et al. Joint link-attribute user identity resolution inonline social networks. SNA-KDD Workshop at KDD, 2012.P. Jain et al. i seek’fb.me’: identifying users across multipleonline social networks. WWW, 2013.Malhotra et al. Studying user footprints in different onlinesocial networks. IEEE Computer Society, 2012.Sironi. Automatic alignment of user identities inheterogeneous social networks. 2012.Veldman. Matching profiles from social network sites. 2009.

BUT:Our experiment is the most large-scale up to date.

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 7: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Outline

1 The Problem

2 The Data

3 The Method

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 8: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Facebook (FB) and VKontakte (VK) types of data

Profiles: a set of user attributescategorical variables (region, city, profession, etc.)integer variables (age, graduation year, etc.)text variables (name, surname, etc.)

Network: a graph that relates usersfriendship graphfollowers graphcommenting graph, etc.

Texts:postscommentsgroup titles and descriptions

Multimedia content:AvatarPhotosVideosMusic

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 9: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Dataset

VK FacebookNumber of users in our dataset 89,561,085 2,903,144Number of users in Russia 1 100,000,000 13,000,000User overlap 29% 88%

training set: 92,488 matched FB-VK profiles

1According to comScore and http://vk.com/aboutAlexander Panchenko Matching Profiles of Facebook and VK Users

Page 10: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

How training data can be obtained?

. . . also valid for the “cheap matching”!

Link to FB in VK profileLink to FB and VK in a third network, e.g. LJ or FoursquareLinking by emailLinking by phone

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 11: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Gathering of VK and FB data

Big Data: VK worth tens or even hundreds of TBDecide what do you need (posts, profiles, etc.).Download:

APIScraping

Download limits and API limitations are specific for eachnetwork.Parallelization is very practical, especially horizontal one:

Amazon EC2, Distributed Message Queues

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 12: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Storing VK and FB data

Again, Big DataNoSQL solutions are helpfulRaw data: Amazon S3For analysis: HDFSEfficient retrieval: Elastic Search

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 13: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Outline

1 The Problem

2 The Data

3 The Method

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 14: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Profile matching algorithm

1 Candidate generation. For each VK profile we retrieve a setof FB profiles with similar first and second names.

2 Candidate ranking. The candidates are ranked according tosimilarity of their friends.

3 Selection of the best candidate. The goal of the final stepis to select the best match from the list of candidates.

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 15: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Candidate generation

Retrieve FB users with names similar to an input VK profile.Two names are similar if:

the first letters are the samethe edit distance between names ≤ 2

Levenshtein Automata for edit distance of namesUse an automatically extracted dictionary of namesynonyms:

“Alexander”, “Sasha”, “Sanya”, “Sanek”, etc.

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 16: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Candidate ranking

The higher the number of friends with similar names in VKand FB profiles, the greater the similarity of these profiles.Two friends are considered to be similar if:

First two letters of their last names matchSimilarity between first/last names sims are greater thanthresholds α, β:

sims(si , sj) = 1− lev(si , sj)max(|si |, |sj |)

,

Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency:

simp(pvk , pfb) =∑

j :sims(sfi ,s

fj )>α∧sims(ss

i ,ssj )>β

min(1,N

|s fj | · |ss

j |).

Here s fi and ss

i are first and second names of a VK profile,correspondingly, while s f

j and ssj refer to a FB profile.

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 17: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Best candidate selection

FB candidates are ranked according to similarity simp to aninput profile pvk

The best candidate pfb should pass two thresholds to match:

its score should be higher than the score threshold γ:

simp(pvk , pfb) > γ.

either the only candidate or score ratio between it and the nextbest candidate p′

fb should be higher than the ratio threshold δ:

simp(pvk , pfb)

simp(pvk , p′fb)

> δ.

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 18: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Results

Figure : Precision-recall plot of the matching method. The bold linedenotes the best precision at given recall.

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 19: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Results: matching VK and FB profiles

First name threshold, α 0.8Second name threshold, β 0.6Profile score threshold, γ 3Profile ratio threshold, δ 5Number of matched profiles 644,334 (22%)Expected precision 0.98Expected recall 0.54

Alexander Panchenko Matching Profiles of Facebook and VK Users

Page 20: Matching Profiles of Facebook and VK UsersIndex

The Problem The Data The Method

Thank you! Questions?

Alexander Panchenko Matching Profiles of Facebook and VK Users