57
Yandex Relevance Prediction Challenge Overview of “CLL” team’s solution R. Gareev 1 , D. Kalyanov 2 , A. Shaykhutdinova 1 , N. Zhiltsov 1 1 Kazan (Volga Region) Federal University 2 10tracks.ru 28 December 2011 1 / 52

Internet Mathematics 2011

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Internet Mathematics 2011

Yandex Relevance Prediction ChallengeOverview of “CLL” team’s solution

R. Gareev1, D. Kalyanov2, A. Shaykhutdinova1, N. Zhiltsov1

1Kazan (Volga Region) Federal University

210tracks.ru

28 December 2011

1 / 52

Page 2: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

2 / 52

Page 3: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

3 / 52

Page 4: Internet Mathematics 2011

Problem statement

I Predict document relevance from user behavior a.k.a«Implicit Relevance Feedback»

I See also http://imat-relpred.yandex.ru/en formore details

4 / 52

Page 5: Internet Mathematics 2011

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

5 / 52

Page 6: Internet Mathematics 2011

Labeled data

Given judgements for some pairs of documents and queries:

I a document Dj is relevant for a query Qi from aregion Ror

I a document Dj is not relevant for a query Qi from aregion R

6 / 52

Page 7: Internet Mathematics 2011

The problem

I Given a set Q of search queries, for each (q, R) ∈ Qprovide a sorted list of documents D1, . . . , Dm that arerelevant to q in the region R

I Area Under the ROC Curve (AUC) averaged over allthe test query-pairs is the target evaluation metric

7 / 52

Page 8: Internet Mathematics 2011

AUC score

I Consider list of documents: D1, . . . , Di︸ ︷︷ ︸Prefix of length i

, . . . Dm

I (FPR(i),TPR(i)) gives a single point in ROC curveI AUC is the area under ROC curveI AUC = Probability that randomly chosen relevant document

come before randomly chosen non-relevand document8 / 52

Page 9: Internet Mathematics 2011

Our problem restatement

I We consider it as a machine learning taskI Using relevance judgements, learn a classifierH(R,Q,D) that predicts that document D is relevantto a query Q from a region R

I Replace RegionID, QueryID and DocumentID withrelated features extracted from click log

I Use the classifier H(R,Q,D) to compute a list, sortedw.r.t. classifier’s certainty scores, for a query Q from aregion R

9 / 52

Page 10: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

10 / 52

Page 11: Internet Mathematics 2011

Features

I A «feature» is a function of (Q,R,D).I Each feature is associated/not associated with itsrelated region

TypesI Document featuresI Query featuresI Time-concerned features

11 / 52

Page 12: Internet Mathematics 2011

Document features

1 (Q,D)→ Number of occurences of an URL in the SERP list2 (Q,D)→ Number of clicks3 (Q,D)→ Click-through rate4 (Q,D)→ Average position in the click sequence5 (Q,D)→ Average rank in the SERP list6 (Q,D)→ Average rank in the SERP list when URL is clicked7 (Q,D)→ Probability of being last clicked8 (Q,D)→ Probability of being first clicked

12 / 52

Page 13: Internet Mathematics 2011

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

13 / 52

Page 14: Internet Mathematics 2011

Query features

1 (Q)→ Average number of clicks in subsession2 (Q)→ Probability of being rewritten (being not last query insession)

3 (Q)→ Probability of being resolved (probability of its resultsbeing last clicked)

14 / 52

Page 15: Internet Mathematics 2011

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

15 / 52

Page 16: Internet Mathematics 2011

Time-concerned features

1 (Q)→ Average time to first click2 (Q,D)→ Average time spent reading a document D

16 / 52

Page 17: Internet Mathematics 2011

User session example

RegionQ1 ⇒ 1 2 3 4 5 T = 0

3 T = 105 T = 351 T = 100

Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170

17 / 52

Page 18: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

18 / 52

Page 19: Internet Mathematics 2011

Two phase extraction

1 Normalization• lookup filtering by ’Important triples’ set• normalization is specific for each feature

2 Grouping and aggregating

19 / 52

Page 20: Internet Mathematics 2011

Important triples

20 / 52

Page 21: Internet Mathematics 2011

Normalization

I Converting clicklog entries to a relational table withthe following attributes:

• feature domain attributes, e.g:• (Q,R,U), (Q,U) for document features• (Q,R), (Q) for query features

• feature attribute valueI Sequential processing ’session-by-session’

• reject spam sessions• emit values (probably repeated)

21 / 52

Page 22: Internet Mathematics 2011

Normalization example (I)Click log (with SessionID, TimePassed omitted):Action QueryID RegionID URLsQ 174 0 1625 1627 1623 2510 2524Q 1974 0 2091 17562 1626 1623 1627C 17562C 1627C 1625C 2510

Intermediate table for ’Average click position’ feature:QueryID URLID RegionID ClickPosition1974 17562 0 11974 1627 0 2174 1625 0 1174 2510 0 2

22 / 52

Page 23: Internet Mathematics 2011

Normalization example (II)

Click log (sessionID was omitted):Time QueryID RegionID URLs

0 Q 5 0 99 16 87 396 C 84

120 Q 558 0 84 5043 5041 5039125 Q 8768 0 74672 74661 74674 74671145 C 74661

Intermediate table for ’Time to first click’ feature:QueryID RegionID FirstClickTime

5 0 68768 0 20

23 / 52

Page 24: Internet Mathematics 2011

Aggregation example (by triple)

24 / 52

Page 25: Internet Mathematics 2011

Aggregation example (by QU-pair)

25 / 52

Page 26: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

26 / 52

Page 27: Internet Mathematics 2011

Our final ML based solution in a nutshell

I Binary classification task for predicting assessors’ labelsI 26 features extracted from the click logI Gradient Boosted Trees learning model (gbm Rpackage)

I Tuning model’s parameters w.r.t. AUC averaged overgiven query-region pairs

I Ranking URLs according to the best model’sprobability scores

27 / 52

Page 28: Internet Mathematics 2011

Training data

28 / 52

Page 29: Internet Mathematics 2011

Training dataTarget values

29 / 52

Page 30: Internet Mathematics 2011

Training dataFeature values

30 / 52

Page 31: Internet Mathematics 2011

Training dataMissing values

31 / 52

Page 32: Internet Mathematics 2011

Data Analysis Scheme1 Given initial training and test sets

2 Partitioning the initial training set into two sets:• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Page 33: Internet Mathematics 2011

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Page 34: Internet Mathematics 2011

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression

4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Page 35: Internet Mathematics 2011

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Page 36: Internet Mathematics 2011

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set

6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Page 37: Internet Mathematics 2011

Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:

• training set (3/4)• test set (1/4)

3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)

• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation

5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set

32 / 52

Page 38: Internet Mathematics 2011

Boosting[Schapire, 1990]

I Given training set (x1, y1), . . . , (xN , yN),yi ∈ {−1,+1}

I For t = 1, . . . , T• construct distribution Dt on {1, . . . , N}• sample examples from it concentrating on the “hardest” ones• learn a “weak classifier” (at least better than random)

ht : X → {−1,+1}

with error εt on Dt:

εt = Pi∼Dt(ht(xi) 6= yi)

I Output the final classifier H as a weighted majorityvote of ht

33 / 52

Page 39: Internet Mathematics 2011

AdaBoost[Freund & Schapire, 1997]

I Constructing Dt:• D1(i) =

1N

• given Dt and ht:

Dt+1(i) =Dt(i)

Zt×{e−αt if yi = ht(xi)eαt if yi 6= ht(xi)

where Zt – normalization factor and

αt =1

2ln

(1− εtεt

)> 0

I Final classifier:

H(x) = sign

(∑t

αtht(x)

)34 / 52

Page 40: Internet Mathematics 2011

Gradient boosted trees[Friedman, 2001]

I Stochastic gradient decent optimization of the lossfunction

I Decision trees model as a weak classifierI Do not require feature normalizationI There is no need to handle missing values specificallyI Reported good performance in relevance predictionproblems [Piwowarski et al., 2009], [Hassan et al.,2010] and [Gulin et al., 2011]

35 / 52

Page 41: Internet Mathematics 2011

Gradient boosted treesgbm R package implementation

I There are two available distributions for classificationtasks: Bernoulli and AdaBoost

I Three basic parameters: interaction depth (depth ofeach tree), number of trees (or iterations) andshrinkage (learning rate)

36 / 52

Page 42: Internet Mathematics 2011

Logistic regressionglm, stats R package

I Preprocess the initial training data – imputing missingvalues with the help of bagged trees

I Fit the generalized linear model:

f(x) =1

1 + e−z,

where z = β0 + β1x1 + · · ·+ βkxk

37 / 52

Page 43: Internet Mathematics 2011

Tuning gbmbernoulli model

3-fold CV estimate of AUC for the optimal parameters: 0.6457435

38 / 52

Page 44: Internet Mathematics 2011

Tuning gbmadaboost model

3-fold CV estimate of AUC for the optimal parameters: 0.6455384

39 / 52

Page 45: Internet Mathematics 2011

Comparative performance of three optimal modelsTest error estimates

Model Optimal parameter values Test estimate of AUCgbmbernoulli interaction.depth=2, 0.6324717

n.trees=500,shrinkage=0.01

gbmadaboost interaction.depth=4, 0.6313393n.trees=700,shrinkage=0.01

logistic regression - 0.618648

40 / 52

Page 46: Internet Mathematics 2011

Variable importance according to the best model

41 / 52

Page 47: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

42 / 52

Page 48: Internet Mathematics 2011

Contest statistics

I 101 participants, 84 of them are eligible for prizeI Two-stage evaluation procedure: validation set and testset (their sizes were unknown during the contest)

I Validation set size is ≈ 11 000 instancesI Test set size is ≈ 20 000 instances

43 / 52

Page 49: Internet Mathematics 2011

Preliminary ResultsValidation set

19th place (AUC=0.650004)

44 / 52

Page 50: Internet Mathematics 2011

Final ResultsTest set

34th place (AUC=0.643346)

# Team AUC1 cointegral* 0.6673622 Evlampiy* 0.665063 alsafr* 0.6645274 alexeigor* 0.6631695 keinorhasen 0.6609826 mmp 0.6599147 Cutter* 0.6594528 S-n-D 0.658103. . . . . . . . .34 CLL 0.643346. . . . . . . . .

45 / 52

Page 51: Internet Mathematics 2011

Acknowledgements

We would like to thank:

I the organizers from Yandex for an exciting challenge

I E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleaguesfrom Kazan Federal University for fruitful discussions and support

46 / 52

Page 52: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

47 / 52

Page 53: Internet Mathematics 2011

References I

[Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoreticgeneralization of on-line learning and an application to boosting // Journalof Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139.

[Friedman, 2001] Friedman, J. Greedy Function Approximation: A GradientBoosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P.1189-1232.

[Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning TheTransfer Learning Track of Yahoo!’s Learning To Rank Challenge withYetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P.63-76.

[Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG:User behavior as a predictor of a successful search // Proceedings of thethird ACM international conference on Web search and data mining. –ACM. – 2010. – P. 221-230.

48 / 52

Page 54: Internet Mathematics 2011

References II

[Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining UserWeb Search Activity with Layered Bayesian Networks or How to Capture aClick in its Context // Proceedings of the Second ACM InternationalConference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171.

[Schapire, 1990] Schapire, R. The strength of weak learnability // MachineLearning. – V.5. – No. 2. – 1990. – P. 197–227.

49 / 52

Page 55: Internet Mathematics 2011

Outline

1 Problem Statement

2 Features

3 Feature Extraction

4 Statistical analysis

5 Contest Results

6 Appendix A. References

7 Appendix B. R functions

50 / 52

Page 56: Internet Mathematics 2011

Compute AUC for gbm modelComputeAUC <- function(fit, ntrees, testSet) {require(ROCR)require(foreach)require(gbm)

pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))count <- nrow(queryRegions)

aucValues <- foreach (i=1:count, .combine="c") %do% {queryId <- queryRegions[i,"QueryID"]regionId <- queryRegions[i,"RegionID"]true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabelm <- mean(true.labels)if (m == 0 | m == 1) {

pred <- NAperf <- NAcurAUC <- NA

}else {

gbm.predictions <-predict.gbm(fit,pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,],n.trees=ntrees, type="response")

pred <- prediction(gbm.predictions, true.labels)perf <- performance(pred, "auc")curAUC <- [email protected] [[1]] [1]}

curAUC}return (mean(aucValues, na.rm=T))}

51 / 52

Page 57: Internet Mathematics 2011

Tuning AUC for gbm model

TuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500,step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) {require(gbm)require(foreach)require(caret)require(sqldf)FUN <- match.fun(aucfunction)ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step)

folds <- createFolds(trainSet$QueryID, foldsNum, T, T)aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% {

inTrain <- folds[[i]]cvTrainData <- trainSet[inTrain,]cvTestData <- trainSet[-inTrain,]pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID))

gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution,interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage)

foreach(n=ntreesSeq, .combine="rbind") %do% {auc <- FUN(gbmFit, n, cvTestData)(c(n, auc))

}}aucvalues <- as.data.frame(aucvalues)avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1")return (avgAuc)

}

52 / 52