Upload
alexander-panchenko
View
591
Download
0
Embed Size (px)
DESCRIPTION
People often use several social networks at once as they provide complimentary features and user bases. In Russia, an average user is registered in 2.5 social networks, according to director of the “Moi Mir” social network. Thus, information about one user is scattered across different networks. A profile matching algorithm takes as input a user profile of one social network and returns profile of the same person in another social network if any. Integration of information from various platforms can lead to a better representation of a user in applications such as internet marketing, search or recommendation systems. No wonder several researchers recently tried to tackle this problem (see Bartunov et al., Veldman, Malhotra et al., Sironi, Balduzzi et al., Jain et al.). We present a new user identity resolution approach that uses minimal supervision and achieves precision of 0.98 and recall of 0.54. Unlike most existing approaches, the method is easily parallelizable and computationally efficient and can be used to process an entire social network in a matter of hours. We used it to match Facebook, the most popular social network globally, with VKontakte, the most popular social network with Russian-speaking users. To the best of our knowledge it is the most large-scale matching experiment to date. While most prior experiments operated on datasets ranging from thousands to hundreds of thousands of profiles, we performed a match of 90 million of VKontakte profiles to 3 million of Facebook.com profiles. The algorithm consists of three phases: 1. Candidate generation. For each VK profile we retrieve a set of FB profiles with similar first and second names. 2. Candidate ranking. The candidates are ranked according to similarity of their friends. 3. Selection of the best candidate. The goal of the final step is to select the best match from the list of candidates. Each profile from VK network is processed independently and hence this operation can be easily parallelized (we rely on the MapReduce framework). In our future work, we plan to use supervised learning in order to learn threshold used by candidate ranking and best candidate selection stages. References M. Balduzzi, C. Platzer, T. Holz, E. Kirda, D. Balzarotti, and C. Kruegel. Abusing social networks for automated user profiling. In Recent Advances in Intrusion Detection, pages 422–441. Springer, 2010. S. Bartunov, A. Korshunov, S.-T. Park, W. Ryu, and H. Lee. Joint link-attribute user identity resolution in online social networks. In Proc. of the Sixth SNA-KDD Workshop at KDD, 2012. P. Jain, P. Kumaraguru, and A. Joshi. @ i seek’fb. me’: identifying users across multiple online social networks. In Proceedings of the 22nd international conference on World Wide Web companion, pages 1259–1268. International World Wide Web Conferences Steering Committee, 2013. A. Malhotra, L. Totti, W. Meira Jr, P. Kumaraguru, and V. Almeida. Studying user footprints in different o
Citation preview
The Problem The Data The Method
Matching Profiles of Facebook and VK UsersThe First International Conference on Social Network Analysis,
Higher School of Economics, Moscow, Russia
Alexander Panchenko1,2, Dmitry Babaev1, Sergey Objedkov3
1 – Digital Society Laboratory, 2 – UCLouvain, 3 – HSE
November 21, 2014
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Outline
1 The Problem
2 The Data
3 The Method
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Outline
1 The Problem
2 The Data
3 The Method
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Users in Russia has 2.5 profiles in average
Facebook (FB) and VKontakte (VK)
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Problem
Motivation
input: a user profile of one social networkoutput: profile of the same person in another social networkimmediate applications in marketing, search, security, etc.
Contribution
user identity resolution approachprecision of 0.98 and recall of 0.54the method is computationally effective and easily parallelizable
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Related Work
Several researchers recently tried to tackle this problem:Balduzzi et al. Abusing social networks for automated userprofiling. Springer, 2010.Bartunov et al. Joint link-attribute user identity resolution inonline social networks. SNA-KDD Workshop at KDD, 2012.P. Jain et al. i seek’fb.me’: identifying users across multipleonline social networks. WWW, 2013.Malhotra et al. Studying user footprints in different onlinesocial networks. IEEE Computer Society, 2012.Sironi. Automatic alignment of user identities inheterogeneous social networks. 2012.Veldman. Matching profiles from social network sites. 2009.
BUT:Our experiment is the most large-scale up to date.
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Outline
1 The Problem
2 The Data
3 The Method
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Facebook (FB) and VKontakte (VK) types of data
Profiles: a set of user attributescategorical variables (region, city, profession, etc.)integer variables (age, graduation year, etc.)text variables (name, surname, etc.)
Network: a graph that relates usersfriendship graphfollowers graphcommenting graph, etc.
Texts:postscommentsgroup titles and descriptions
Multimedia content:AvatarPhotosVideosMusic
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Dataset
VK FacebookNumber of users in our dataset 89,561,085 2,903,144Number of users in Russia 1 100,000,000 13,000,000User overlap 29% 88%
training set: 92,488 matched FB-VK profiles
1According to comScore and http://vk.com/aboutAlexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
How training data can be obtained?
. . . also valid for the “cheap matching”!
Link to FB in VK profileLink to FB and VK in a third network, e.g. LJ or FoursquareLinking by emailLinking by phone
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Gathering of VK and FB data
Big Data: VK worth tens or even hundreds of TBDecide what do you need (posts, profiles, etc.).Download:
APIScraping
Download limits and API limitations are specific for eachnetwork.Parallelization is very practical, especially horizontal one:
Amazon EC2, Distributed Message Queues
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Storing VK and FB data
Again, Big DataNoSQL solutions are helpfulRaw data: Amazon S3For analysis: HDFSEfficient retrieval: Elastic Search
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Outline
1 The Problem
2 The Data
3 The Method
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Profile matching algorithm
1 Candidate generation. For each VK profile we retrieve a setof FB profiles with similar first and second names.
2 Candidate ranking. The candidates are ranked according tosimilarity of their friends.
3 Selection of the best candidate. The goal of the final stepis to select the best match from the list of candidates.
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Candidate generation
Retrieve FB users with names similar to an input VK profile.Two names are similar if:
the first letters are the samethe edit distance between names ≤ 2
Levenshtein Automata for edit distance of namesUse an automatically extracted dictionary of namesynonyms:
“Alexander”, “Sasha”, “Sanya”, “Sanek”, etc.
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Candidate ranking
The higher the number of friends with similar names in VKand FB profiles, the greater the similarity of these profiles.Two friends are considered to be similar if:
First two letters of their last names matchSimilarity between first/last names sims are greater thanthresholds α, β:
sims(si , sj) = 1− lev(si , sj)max(|si |, |sj |)
,
Contribution of each friend to similarity simp of two profilespvk and pfb is inverse of name expectation frequency:
simp(pvk , pfb) =∑
j :sims(sfi ,s
fj )>α∧sims(ss
i ,ssj )>β
min(1,N
|s fj | · |ss
j |).
Here s fi and ss
i are first and second names of a VK profile,correspondingly, while s f
j and ssj refer to a FB profile.
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Best candidate selection
FB candidates are ranked according to similarity simp to aninput profile pvk
The best candidate pfb should pass two thresholds to match:
its score should be higher than the score threshold γ:
simp(pvk , pfb) > γ.
either the only candidate or score ratio between it and the nextbest candidate p′
fb should be higher than the ratio threshold δ:
simp(pvk , pfb)
simp(pvk , p′fb)
> δ.
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Results
Figure : Precision-recall plot of the matching method. The bold linedenotes the best precision at given recall.
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Results: matching VK and FB profiles
First name threshold, α 0.8Second name threshold, β 0.6Profile score threshold, γ 3Profile ratio threshold, δ 5Number of matched profiles 644,334 (22%)Expected precision 0.98Expected recall 0.54
Alexander Panchenko Matching Profiles of Facebook and VK Users
The Problem The Data The Method
Thank you! Questions?
Alexander Panchenko Matching Profiles of Facebook and VK Users