21
Deep neural networks for matching online social networking profiles Vicentiu-Marian Ciorbaru & Traian Rebedea University Politehnica of Bucharest, Romania ICCCI 2017 Nicosia, Sep 27th

Deep neural networks for matching online social networking profiles

Embed Size (px)

Citation preview

Deep neural networks for matching online social networking profilesVicentiu-Marian Ciorbaru & Traian Rebedea

University Politehnica of Bucharest, Romania

ICCCI 2017

Nicosia, Sep 27th

Outline

› Introduction

› Related work– Personal web pages deduplication– Social networking profiles matching

› Dataset

› Proposed approach– Unsupervised vs Supervised– Extracted features– Deep neural network for profile matching

› Results

› Conclusions

2 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Introduction

› Online people search is a significant part of web search (Artiles et al., 2010)– 11-17% of queries include a person name– ~4% contain only a person name

› Name ambiguity makes people search a complex problem to solve efficiently– Huge overlap in person names worldwide – The most popular 90,000 full names (first and last name) worldwide are

shared by 100M+ individuals

› An important aspect in people search is to find most/all online sources of information (e.g. web pages) related to the same person– Recent shift from general web pages to specific ones, like social

networking sites and other professional communities

3 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Deep neural networks for matching

online social networking profiles

Introduction › Our problem: given a set of web pages extracted from online social

networks, determine the profiles which relate to the same individual– Profile matching (or deduplication)– Generates a (more) complete online identity for an individual– Only uses public online information, however adding up all this information

about a person can cause privacy concerns

Deep neural networks for matching

online social networking prfioles4 / 21

ICCCI 2017

Nicosia, Sep 27th

Related work

› Two main directions

– Personal web pages deduplication

– Matching social networking profiles

› First problem is more generic and complex, as one also needs to extract personal information (e.g. name, occupation, etc.) from a wide range of different structured web pages

› Entity deduplication, in general, is a very complex field of study in Databases, Natural Language Processing (NLP), and Information Retrieval

5 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Related work› Web People Search (WePS) datasets and competitions (Artiles et al., 2009 & 2010)

› Given all web pages returned by a generic search engine for a popular name, group pages such that each group corresponds to one specic person

› Most solutions employ clustering of the web pages using features extracted from pages such as Wikipedia concepts, Named Entity Recognition (NER), bag of words (BoW), and hyperlinks and different similarity measures

› A pairwise approach for solving this problem was also proposed– Compute the probability that two pages refer to the same person

– Cluster pages by joining pairs that have a high probability to represent the same person

› WePS proposed B-cubed precision and recall for assessing performance

6 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Related work

› More recent research focused on linking social networking profiles belonging to the same individual

› Zhang et al. (2015) proposed a binary classifier using a probabilistic graphical model (factor graph)

› Features computed using BoW and TF-IDF for the text in each profile, but also its social status (position of node in network) and connections

› Our solutions only uses textual features, since the dataset does not contain connections (e.g. friends or followers)– These additional features, or other like avatar/profile image, would

only improve the results

7 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Dataset

› Snapshot of multiple social networking profiles collected from 15 different online social networks and community websites

– Academia, Code-Project, Facebook, Github, Google+, Instagram, Lanyrd, Linkedin, Mashable, Medium, Moz, Quora, Slideshare, Twitter, and Vimeo

› For each profile, we extracted some/all of the following information: username, name (full name or distinct first and last names), gender, bio (short description), interests, publications, jobs, etc.

› The average number of social profiles per individual is 2.04 and the maximum is 10

› Most profile pages feature a brief description (bio) of the owner

› Profiles do not contain connections, nor posts written by the owner

8 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Dataset

› Ground truth obtained from the website about.me– Complete online information for professionals

– Contains links to several social networking profiles of the same person, added manually by each user

› Dataset contains information from over 200,000 about.me accounts

› Total number of extracted social networking profiles: 500,000+

› The corpus was created by Wholi and is one of the largest corpora used for social profile matching

› While other datasets (Perito et al., 2011; Zhang et al.,2016) have a larger number of distinct profiles, ground truth is one order of magnitude larger for our dataset – 200,000+ compared to ~10,000 items

– This allows training more complex classfiers, including deep neural networks

9 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Dataset

› Ground truth data has been manually entered by users– It might be incorrect in some cases (entry errors, user misbehaviour)

– Resembles crowdsourced datasets, which are very popular lately to train complex models

› Train and test sets respect the following rules:1. Train and test sets should contains different online identities (e.g different individuals)

2. The clusters in the training set should have no entries present in the test set in order to avoid overfitted models

3. Test set has the same distribution for cluster sizes as the train set to provide a relevant comparison for various sized online identities

› Positive items extracted from about.me accounts, negative ones added randomly between profiles with similar names, location, etc.

10 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Proposed solution

› Main contribution is using a deep neural network (NN) for matching online social networking profiles

› NN is able to make use efficiently of both textual features and domain-specific ones

› Also performed a comparison with other solutions used in previous studies, employing both unsupervised and supervised methods

› For the unsupervised approach, we first generated the feature vector for each profile, then applied Hierarchical Agglomerative Clustering (HAC) using cosine distance

› For binary classification we have a twofold objective1. Detect whether two profiles refer to the same person and should be matched

(pairwise matching)2. For the graph of connected profiles discovered in phase 1, compute its

connected components

11 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Extracted features

› Given a pair of profiles (a, b)

› Domain specific features: distance based measures based on names (full, first, last) and usernames, matching gender, matching location, matching company/employer, etc.

› Text-based features– Computed from all the other textual attributes in a profile (e.g. bio, publications,

interests)– Used precomputed Word2Vec word embeddings with 300 dimensions,

averaged over all words in a profile– Also computed cosine and Euclidian distance between word embeddings of the

candidate pair (a, b)

› Features normalization– Compute the z-scores for each feature– Whitening using Principal Component Analysis (PCA) in a 25-dimensional

vector space to remove noise

12 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Deep neural network for profile matching

› Given the very large dataset and the recent advances of deep learning, we propose a deep NN model for profile matching

› Deep NNs should be able to model more complex non-linear combinations of the different features (domain specific, word embeddings)

› Proposed a model which uses 6 fully-connected (FC) layers with different activation functions

› The loss function uses cross-entropy, with an added weight for false positives which contribute 10 times more to the loss– Penalizes false connections between profiles and counteracts the imbalanced distribution

13 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Deep neural network for profile matching

› The first layer takes as input the features computed for the candidate profile pair and goes into a larger feature space (612 1024)

› The next two layers iteratively reduce the dimensionality of the representation to a denser feature space

› The final layers employ RELU activation for the neurons, as RELU units are known to provide better results for binary classification (Nair & Hinton, 2010)

› Dropout is employed to avoid overfitting

14 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Results

› Experiments performed using an imbalanced test set with one positive profile pair for 100 negative ones– Reflects a real-world scenario, where for each correct match between two

profiles, one compares tens/hundreds of incorrect (but similar) candidates

› Table shows B-cubed precision and recall obtained on the test set

› Using same names or similar names as baselines for comparison

15 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Results

› Unsupervised methods (HAC) obtain poorer results than baseline mainly because cosine is not a good measure for cluster/item similarity for the proposed feature vectors

› The RF classifier performs well only when domain specific features are added to the word embeddings– The large training set limited the number of trees (to 12) in the forest– RF usually performs poorly when using word embeddings for a pair of

documents (as they cannot compute a more complex similarity function)

› Mini-batch training of NNs allows using larger datasets than for RF

› The deep NN model learns a more complex combination between word embeddings and domain specific features, grouping profiles with similar embeddings and similar names

› Deep NN is the only model which can achieve both high recall (R=0.85) and high precision (P=0.95)

16 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Results – examples › Ground truth

– ['twitter/etniqminerals', 'instagram/etniqminerals', 'googleplus/106318957043871071183', 'facebook/etniqminerals', 'facebook/rockcityelitesalsa', 'facebook/1renaissancewoman', 'facebook/naturalblackgirlguide', 'linkedin/leahpatterson’]

› Computed– [ 'facebook/1renaissancewoman’, 'linkedin/leahpatterson’, 'googleplus/106318957043871071183’]

– ['twitter/etniqminerals', 'instagram/etniqminerals', 'facebook/etniqminerals']

– [ 'facebook/naturalblackgirlguide']

› “Leah Patterson” is an individual who has two different companies “Etniq Minerals” and “Natural Black Girl Guide”

› Ground truth– 3 different individuals whose first name is “Tim” and all of them work in IT

› Computed– ['googleplus/113375270405699485276', 'linkedin/timsmith78', 'googleplus/117829094399867770981',

'twitter/bbyxinnocenz', 'facebook/tim.tio.5', 'vimeo/user616297', 'linkedin/timtio', 'twitter/wbcsaint', 'twitter/turnitontim']

17 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Conclusions

› Proposed a large dataset for matching online social networking profiles

› This allowd us to train a deep neural network for profile matching using both domain-specific features and word embeddings generated from textual descriptions from social profiles

› Experiments showed that the NN surpassed both unsupervised and supervised models, achieving a high precision (P = 0.95) with a good recall rate (R = 0.85)

› As far as we know, this result outperforms existing approaches for profile matching, but further validation is needed (to adapt it for other datasets and/or use other methods on current dataset)

› Further advancements can be made by training more complex deep learning models, using recurrent or convolutional networks, and by adding features extracted from profile pictures

18 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Conclusions

› A similar architecture has been proposed by Google (Convington et al., 2016) for recommending YouTube videos

› However we have only found this work recently

19 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Thank you!

Questions

Feedback

20 / 21

_____

_____

ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles

Selected references› Artiles, J., Borthwick, A., Gonzalo, J., Sekine, S., Amigo, E.: Weps-3 evaluation campaign: Overview of the web

people search clustering and attribute extraction tasks. In: CLEF (Notebook Papers/LABs/Workshops) (2010)

› Artiles, J., Gonzalo, J., Sekine, S.: Weps 2 evaluation campaign: overview of the web people search clustering task. In: 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference. vol. 9. Citeseer (2009)

› Covington, P., Adams, J., & Sargin, E.: Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (pp. 191-198). ACM (2016)

› Nair, V., Hinton, G.E.: Rectied linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). pp. 807-814 (2010)

› Perito, D., Castelluccia, C., Kaafar, M.A., Manils, P.: How unique and traceable are usernames? In: Proceedings of the 11th International Conference on Privacy Enhancing Technologies. pp. 1-17. PETS'11, Springer-Verlag, Berlin, Heidelberg (2011)

› Zhang, Y., Tang, J., Yang, Z., Pei, J., Yu, P.S.: Cosnet: Connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1485-1494. KDD '15, ACM, New York, NY, USA (2015)

21 / 21ICCCI 2017

Nicosia, Sep 27th

Deep neural networks for matching

online social networking profiles