Upload
dmitryvk
View
765
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Yandex Relevance Prediction ChallengeOverview of “CLL” team’s solution
R. Gareev1, D. Kalyanov2, A. Shaykhutdinova1, N. Zhiltsov1
1Kazan (Volga Region) Federal University
210tracks.ru
28 December 2011
1 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
2 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
3 / 52
Problem statement
I Predict document relevance from user behavior a.k.a«Implicit Relevance Feedback»
I See also http://imat-relpred.yandex.ru/en formore details
4 / 52
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
5 / 52
Labeled data
Given judgements for some pairs of documents and queries:
I a document Dj is relevant for a query Qi from aregion Ror
I a document Dj is not relevant for a query Qi from aregion R
6 / 52
The problem
I Given a set Q of search queries, for each (q, R) ∈ Qprovide a sorted list of documents D1, . . . , Dm that arerelevant to q in the region R
I Area Under the ROC Curve (AUC) averaged over allthe test query-pairs is the target evaluation metric
7 / 52
AUC score
I Consider list of documents: D1, . . . , Di︸ ︷︷ ︸Prefix of length i
, . . . Dm
I (FPR(i),TPR(i)) gives a single point in ROC curveI AUC is the area under ROC curveI AUC = Probability that randomly chosen relevant document
come before randomly chosen non-relevand document8 / 52
Our problem restatement
I We consider it as a machine learning taskI Using relevance judgements, learn a classifierH(R,Q,D) that predicts that document D is relevantto a query Q from a region R
I Replace RegionID, QueryID and DocumentID withrelated features extracted from click log
I Use the classifier H(R,Q,D) to compute a list, sortedw.r.t. classifier’s certainty scores, for a query Q from aregion R
9 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
10 / 52
Features
I A «feature» is a function of (Q,R,D).I Each feature is associated/not associated with itsrelated region
TypesI Document featuresI Query featuresI Time-concerned features
11 / 52
Document features
1 (Q,D)→ Number of occurences of an URL in the SERP list2 (Q,D)→ Number of clicks3 (Q,D)→ Click-through rate4 (Q,D)→ Average position in the click sequence5 (Q,D)→ Average rank in the SERP list6 (Q,D)→ Average rank in the SERP list when URL is clicked7 (Q,D)→ Probability of being last clicked8 (Q,D)→ Probability of being first clicked
12 / 52
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
13 / 52
Query features
1 (Q)→ Average number of clicks in subsession2 (Q)→ Probability of being rewritten (being not last query insession)
3 (Q)→ Probability of being resolved (probability of its resultsbeing last clicked)
14 / 52
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
15 / 52
Time-concerned features
1 (Q)→ Average time to first click2 (Q,D)→ Average time spent reading a document D
16 / 52
User session example
RegionQ1 ⇒ 1 2 3 4 5 T = 0
3 T = 105 T = 351 T = 100
Q2 ⇒ 6 7 8 9 10 T = 1306 T = 1509 T = 170
17 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
18 / 52
Two phase extraction
1 Normalization• lookup filtering by ’Important triples’ set• normalization is specific for each feature
2 Grouping and aggregating
19 / 52
Normalization
I Converting clicklog entries to a relational table withthe following attributes:
• feature domain attributes, e.g:• (Q,R,U), (Q,U) for document features• (Q,R), (Q) for query features
• feature attribute valueI Sequential processing ’session-by-session’
• reject spam sessions• emit values (probably repeated)
21 / 52
Normalization example (I)Click log (with SessionID, TimePassed omitted):Action QueryID RegionID URLsQ 174 0 1625 1627 1623 2510 2524Q 1974 0 2091 17562 1626 1623 1627C 17562C 1627C 1625C 2510
Intermediate table for ’Average click position’ feature:QueryID URLID RegionID ClickPosition1974 17562 0 11974 1627 0 2174 1625 0 1174 2510 0 2
22 / 52
Normalization example (II)
Click log (sessionID was omitted):Time QueryID RegionID URLs
0 Q 5 0 99 16 87 396 C 84
120 Q 558 0 84 5043 5041 5039125 Q 8768 0 74672 74661 74674 74671145 C 74661
Intermediate table for ’Time to first click’ feature:QueryID RegionID FirstClickTime
5 0 68768 0 20
23 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
26 / 52
Our final ML based solution in a nutshell
I Binary classification task for predicting assessors’ labelsI 26 features extracted from the click logI Gradient Boosted Trees learning model (gbm Rpackage)
I Tuning model’s parameters w.r.t. AUC averaged overgiven query-region pairs
I Ranking URLs according to the best model’sprobability scores
27 / 52
Data Analysis Scheme1 Given initial training and test sets
2 Partitioning the initial training set into two sets:• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression
4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set
6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
Data Analysis Scheme1 Given initial training and test sets2 Partitioning the initial training set into two sets:
• training set (3/4)• test set (1/4)
3 Consider the following models:• Gradient Boosted Trees (Bernoulli distribution; 0-1 loss function)• Gradient Boosted Trees (Adaboost distribution; exponential lossfunction)
• Logistic Regression4 Learning and tuning parameters w.r.t. the target metric (Areaunder the ROC curve) on the training set using 3-foldcross-validation
5 Obtain the estimates for the target metric on the test set6 Choose the optimal model, refit it on the whole initial trainingset and apply it on the initial test set
32 / 52
Boosting[Schapire, 1990]
I Given training set (x1, y1), . . . , (xN , yN),yi ∈ {−1,+1}
I For t = 1, . . . , T• construct distribution Dt on {1, . . . , N}• sample examples from it concentrating on the “hardest” ones• learn a “weak classifier” (at least better than random)
ht : X → {−1,+1}
with error εt on Dt:
εt = Pi∼Dt(ht(xi) 6= yi)
I Output the final classifier H as a weighted majorityvote of ht
33 / 52
AdaBoost[Freund & Schapire, 1997]
I Constructing Dt:• D1(i) =
1N
• given Dt and ht:
Dt+1(i) =Dt(i)
Zt×{e−αt if yi = ht(xi)eαt if yi 6= ht(xi)
where Zt – normalization factor and
αt =1
2ln
(1− εtεt
)> 0
I Final classifier:
H(x) = sign
(∑t
αtht(x)
)34 / 52
Gradient boosted trees[Friedman, 2001]
I Stochastic gradient decent optimization of the lossfunction
I Decision trees model as a weak classifierI Do not require feature normalizationI There is no need to handle missing values specificallyI Reported good performance in relevance predictionproblems [Piwowarski et al., 2009], [Hassan et al.,2010] and [Gulin et al., 2011]
35 / 52
Gradient boosted treesgbm R package implementation
I There are two available distributions for classificationtasks: Bernoulli and AdaBoost
I Three basic parameters: interaction depth (depth ofeach tree), number of trees (or iterations) andshrinkage (learning rate)
36 / 52
Logistic regressionglm, stats R package
I Preprocess the initial training data – imputing missingvalues with the help of bagged trees
I Fit the generalized linear model:
f(x) =1
1 + e−z,
where z = β0 + β1x1 + · · ·+ βkxk
37 / 52
Tuning gbmbernoulli model
3-fold CV estimate of AUC for the optimal parameters: 0.6457435
38 / 52
Tuning gbmadaboost model
3-fold CV estimate of AUC for the optimal parameters: 0.6455384
39 / 52
Comparative performance of three optimal modelsTest error estimates
Model Optimal parameter values Test estimate of AUCgbmbernoulli interaction.depth=2, 0.6324717
n.trees=500,shrinkage=0.01
gbmadaboost interaction.depth=4, 0.6313393n.trees=700,shrinkage=0.01
logistic regression - 0.618648
40 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
42 / 52
Contest statistics
I 101 participants, 84 of them are eligible for prizeI Two-stage evaluation procedure: validation set and testset (their sizes were unknown during the contest)
I Validation set size is ≈ 11 000 instancesI Test set size is ≈ 20 000 instances
43 / 52
Preliminary ResultsValidation set
19th place (AUC=0.650004)
44 / 52
Final ResultsTest set
34th place (AUC=0.643346)
# Team AUC1 cointegral* 0.6673622 Evlampiy* 0.665063 alsafr* 0.6645274 alexeigor* 0.6631695 keinorhasen 0.6609826 mmp 0.6599147 Cutter* 0.6594528 S-n-D 0.658103. . . . . . . . .34 CLL 0.643346. . . . . . . . .
45 / 52
Acknowledgements
We would like to thank:
I the organizers from Yandex for an exciting challenge
I E.L. Stolov, V.Y. Mikhailov, V.D. Solovyev and other colleaguesfrom Kazan Federal University for fruitful discussions and support
46 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
47 / 52
References I
[Freund & Schapire, 1997] Freund, Y., Schapire, R. A decision-theoreticgeneralization of on-line learning and an application to boosting // Journalof Computer and System Sciences. – V. 55. – No. 1. – 1997. – P. 119–139.
[Friedman, 2001] Friedman, J. Greedy Function Approximation: A GradientBoosting Machine // Annals of Statistics. – V. 29. – No.5.– 2001. – P.1189-1232.
[Gulin et al., 2011] Gulin, A., Kuralenok, I., Pavlov, D. Winning TheTransfer Learning Track of Yahoo!’s Learning To Rank Challenge withYetiRank // JMLR: Workshop and Conference Proceedings. – 2011. – P.63-76.
[Hassan et al., 2010] Hassan, A., Jones, R., Klinkner, K.L. Beyond DCG:User behavior as a predictor of a successful search // Proceedings of thethird ACM international conference on Web search and data mining. –ACM. – 2010. – P. 221-230.
48 / 52
References II
[Piwowarski et al., 2009] Piwowarski, B., Dupret, G., Jones, R. Mining UserWeb Search Activity with Layered Bayesian Networks or How to Capture aClick in its Context // Proceedings of the Second ACM InternationalConference on Web Search and Data Mining. – ACM. – 2009. – P. 162-171.
[Schapire, 1990] Schapire, R. The strength of weak learnability // MachineLearning. – V.5. – No. 2. – 1990. – P. 197–227.
49 / 52
Outline
1 Problem Statement
2 Features
3 Feature Extraction
4 Statistical analysis
5 Contest Results
6 Appendix A. References
7 Appendix B. R functions
50 / 52
Compute AUC for gbm modelComputeAUC <- function(fit, ntrees, testSet) {require(ROCR)require(foreach)require(gbm)
pureTestSet <- subset(testSet, select=-c(QueryID, RegionID, URLID, RelevanceLabel))queryRegions <- unique(subset(testSet, select=c(QueryID, RegionID)))count <- nrow(queryRegions)
aucValues <- foreach (i=1:count, .combine="c") %do% {queryId <- queryRegions[i,"QueryID"]regionId <- queryRegions[i,"RegionID"]true.labels <- testSet[testSet$QueryID == queryId & testSet$RegionID == regionId, ]$RelevanceLabelm <- mean(true.labels)if (m == 0 | m == 1) {
pred <- NAperf <- NAcurAUC <- NA
}else {
gbm.predictions <-predict.gbm(fit,pureTestSet[testSet$QueryID == queryId & testSet$RegionID == regionId,],n.trees=ntrees, type="response")
pred <- prediction(gbm.predictions, true.labels)perf <- performance(pred, "auc")curAUC <- [email protected] [[1]] [1]}
curAUC}return (mean(aucValues, na.rm=T))}
51 / 52
Tuning AUC for gbm model
TuningGbmFit <- function(trainSet, foldsNum = 3, interactionDepth=4, minNumTrees=100, maxNumTrees = 1500,step=100, shrinkage=.01, distribution="bernoulli", aucfunction=ComputeAUC) {require(gbm)require(foreach)require(caret)require(sqldf)FUN <- match.fun(aucfunction)ntreesSeq <- seq(from=minNumTrees, to=maxNumTrees, by=step)
folds <- createFolds(trainSet$QueryID, foldsNum, T, T)aucvalues <- foreach (i=1:length(folds), .combine="rbind") %do% {
inTrain <- folds[[i]]cvTrainData <- trainSet[inTrain,]cvTestData <- trainSet[-inTrain,]pureCvTrainData <- subset(cvTrainData, select=-c(QueryID, RegionID, URLID))
gbmFit <- gbm(formula=formula(pureCvTrainData), data=pureCvTrainData, distribution=distribution,interaction.depth=interactionDepth, n.trees=maxNumTrees, shrinkage=shrinkage)
foreach(n=ntreesSeq, .combine="rbind") %do% {auc <- FUN(gbmFit, n, cvTestData)(c(n, auc))
}}aucvalues <- as.data.frame(aucvalues)avgAuc <- sqldf("select V1 as ntrees, avg(V2) as AvgAUC from aucvalues group by V1")return (avgAuc)
}
52 / 52