CIKM Cup 2016: Cross-Device Linking

CIKM CUP 2016: Track 1

Cross-Device LinkingAlexey Grigorev

Berlin Machine Learning2016.12.05

About Me

Software Developer BI Masters @ TU Berlin Data Scientist

CIKM Cup 2016: Cross-Device Linking

user advertisements ad providers

Goal: Restore the Graph

?

training data:know the links

new unseen devices:no links

Data

240k train “users” (devices), 100k test users500k train device-device pairs, 215k test pairs

Denormalized: 2.5 Gb click logs + 1 Gb URLs & titles67m clicks in total, 197 clicks per user on average

How to Approach?

● Machine Learning?● First, optimize Recall

○ IR, unsupervised○ Select “candidate” device-device pairs○ Build a design matrix

● Then, optimize Precision○ ML, supervised○ Push the true pairs up the list

● Select top K pairs s.t. F1 is max

Train Test

Optimizing Recall● Recall: fraction of all true device pairs we discover● Information Retrieval problem!● For each device need to find the most similar ones● Device == Document with

○ Tokens from all visited URL + Tokens from all titles○ Put them together into a one single document

● Then use standard IR methods like TF-IDF

Optimizing Recall

most similar

least similar

IR

ES MLT query Top 70 candidatesDevice(240k + 100k) * 70 = 24m

1

1

0

Optimizing Precision

1

1

0

0

0

1

● Now have high Recall, but low Precision○ Recall: fraction of all positive pairs we discover○ Precision: fraction of positive pairs within our results

● Use Supervised Machine Learning for improving it

Next steps:● Create features for each device pair● Train a ranking ML model ● Take top most reliable predictions

Features: Profiling● Create a profile for each user● Profile from sessions (30 minutes inactivity cut):

○ Session duration○ Number of visits per session○ Number of sessions with only one visit○ Duration of breaks between and within sessions○ Number of consecutive requests with a ≤ 1ms delay○ Starts and ends of sessions○ Number of unique domains per session○ Similarity of domains/urls/titles within each session○ For all features: min, mean, max and std

Features: Device-Device Similarities

● |profile1.feature - profile2.feature|● TF-IDF similarity of

○ Domains○ Titles○ URLs

● LSA similarity of the same● 54 features in total

Optimizing Precision: Ranking Model

1

1

0

1

0

1

0.90

0.87

0.2

0.3

0.7

Train 240k * 70 = 17m

100k-200kTop K

Test 100k * 70 = 7m

Features: Importance

pair features

profiledifferencefeatures

XGB feature importance: # times used in split

Cross-Validation

FOLD 1 FOLD 2vs

information leak!

Cross-Validation● Split the graph into non-overlapping

regions● For each region separately

○ Build ES index (i.e. apply filter)○ Build a model

● Evaluation (AUC + F1):○ Apply F1 model to F2 data○ And vise versa

F1 F2

0.90

0.87

0.7

0.90

0.87

0.7

EvaluationPublic/private test split: 50/50

● During the competition: ○ Evaluation on 1st half of data

● After the competition: 2nd half● P = 0.5 of real P

Test

normal F1 “real” evaluation function

Choosing K● Order the pairs by the probability● For each K calculate P, R and F1● Select best K such that F1 is max

● 8th position

Post-CompetitionWhat did others do?

● Using several candidate selection methods● Stacking with rank features (by D. Dremov)● Markov Clustering (by I. Bendyna)

Rank Features

source: http://gh.mltrainings.ru/presentations/Dremov_CIKMCup2016_DCA.pdf slide 9

● Relative position of a node within a group● Motivation: “local” within-group effect instead of global● df_train.groupby('user_1')[feature].rank()

http://gh.mltrainings.ru/presentations/Dremov_CIKMCup2016_DCA.pdf

Stacking (post competition)

all featuresXGBoost

ET

best features

XGBoost

rank features

8th → 5th position

Markov Clustering

source: http://gh.mltrainings.ru/presentations/Bendyna_CIKMCup2016_DCA.pdf

● take a connected component● add loops● put into a Markov Matrix M

○ also called “Stochastic Matrix”○ values in cols sum up to 1

● calculate M ** n○ ~ n Random Walk steps

● for each element M.v = M.v ** p○ makes weak links weaker

● re-normalize and repeat

Animation http://micans.org/mcl/ani/mcl-animation.html

http://gh.mltrainings.ru/presentations/Bendyna_CIKMCup2016_DCA.pdf

http://micans.org/mcl/ani/mcl-animation.html

http://micans.org/mcl/ani/mcl-animation.html

Links & Further Info● Competition website: http://cikmcup.org/● Competition platform: https://competitions.codalab.org/competitions/11171● My solution: https://github.com/alexeygrigorev/cikm-cup-2016-cross-device● Reports: http://cikmcup.org/workshop.html

Self-promotion:

● http://alexeygrigorev.com/● [email protected]

http://cikmcup.org/

https://competitions.codalab.org/competitions/11171

https://github.com/alexeygrigorev/cikm-cup-2016-cross-device

http://cikmcup.org/workshop.html

http://alexeygrigorev.com/

http://alexeygrigorev.com/

mailto:[email protected]

mailto:[email protected]

Thank you. Questions?

Data & Analytics

CIKM Cup 2016: Cross-Device Linking