12
Data Mining Techniques - Assignment 2 Authors: Hidde Hovenkamp (2541936) and Dennis Ramondt (2540351) Vrije Universiteit, Amsterdam

DM Assignment 2 - Group 6

Embed Size (px)

DESCRIPTION

DM

Citation preview

Page 1: DM Assignment 2 - Group 6

Data Mining Techniques - Assignment 2

Authors: Hidde Hovenkamp (2541936) and Dennis Ramondt (2540351)

Vrije Universiteit, Amsterdam

Page 2: DM Assignment 2 - Group 6

1 Introduction

This paper presents the approach, results and learning process of group 6’s participation in the Data Mining

Techniques class competition. The challenge is part of a now-closed Kaggle competition around learning to rank

hotels in order to maximise bookings for hotel queries on Expedia.com. The dataset consists of search and hotel ID

pairs, populated with hotel characteristics such as displayed booking price, location attractiveness, star rating, etc.

A search-hotel pair is assigned a relevance score of 1 if it has only been clicked by the user, and 5 if it has also been

booked. This score is used to calculate the Normalized Discounted Cumulative Gain, used for the hotel ranking and

final evaluation of participating teams. The defining characteristic of our approach is that we decided to implement

a ranking algorithm package called RankLib, which allowed us to focus our attention on the feature creation and

selection.

This paper is structured as follows. Section 2 consists of data exploration and preprocessing, during which we

discuss important properties of the dataset and select and create relevant hotel features. Section 3 explains the

feature selection. Section 4 explains the modelling procedure and which ranking algorithms were chosen. Sections

4 and 5 also give special attention to the approach taken by teams in the Kaggle competition. Section 5 presents

our results and section 5 draws conclusions and evaluates the modelling process. Section 7 is the process report,

describing how we worked together, how we divided tasks and what could be improved.

Fig. 1. Share of missing values per feature, shown only for features that contained missing values

2

Page 3: DM Assignment 2 - Group 6

2 Data Preparation

2.1 Exploration

The training set consists of 4.958.347 search-hotel pairs with 199.549 unique searches and 51 features; the test set

contains 4.959.183 search-hotel pairs with 199.795 unique searches and 47 features. An initial inspection of the data

reveals some interesting properties. First of all, as visible in Figure 1, several features consist for a large part of

missing values. Furthermore, the dataset consists of only 4.47% positive outcomes (a relevance score of 1 or 5),

which could cause certain ranking algorithms to train mostly for negative outcomes.

Fig. 2. Click and book percentages when ranked randomly or by Expedia’s own ranking algorithm.

Figure 2 shows the percentage of entries that were clicked or booked as a function of the ranking position, which

was computed either randomly or by Expedia’s own algorithm. It clearly shows that the likelihood of a hotel being

booked after it has been clicked is a lot higher when Expedia provides the ranking. Finally, histogram plots of three

numerical features in Figure 3 show that they have highly skewered distributions with extreme outliers.

2.2 Feature Creation

The challenge of creating a good feature set is to select and create features that are expected to be highly correlated

with the relevance scores in both training and test sets. Overall, publications, discussions on the Kaggle forum and

team presentations suggest that the price, second location score and destination ID are the strongest predictors [3].

Table 1 shows an overview of the transformed and composite features we created, the rationale behind which is

discussed below. We have used logistic regression on the outcome variable (relevance score) in order to assess its

relevance.

Missing value imputation and outliers As described, many features contain a significant amount of missing

values. In many cases, a missing value is information in itself; when a hotel has no previous reviews, we take this

missing information as something negative and impute with zero value. Missing review scores, second location

scores, search log queries were set to zero, others were imputed with the median. Furthermore, as Figure 3 showed,

the numerical features contain extreme outliers in the 0.999 quantile which have been deleted.

3

Page 4: DM Assignment 2 - Group 6

Fig. 3. Histogram plots of three relevant numerical features, with 0.999 quantiles indicated, above which datapoints weredeleted as outliers.

Monotonic utility Various numerical features will have preference profiles in the shape of a peaked distribution,

implying some optimal prediction value. However, the team of Jun and Wang (second place in the competition)

rightfully proposed to choose features with monotonically increasing utility with respect to the target variable; i.e.

where a higher feature score implies higher chances of being booked. Such a transformation can be achieved by

taking the absolute value of a feature when its mean has been subtracted, conditional on it being a booking. Figure

4 shows the a histogram of booking frequencies from which it can be seen how certain features are monotonically

increasing and others less so.

Fig. 4. Histogram plots of booking frequencies for several features.

Normalisation Furthermore, Owen (first place in competition) explains that some numerical features need to

be normalised (by subtracting a subgroup average). For example, certain searches may have proportionally better

or worse hotels in them, which puts them at a disadvantage or advantage with respect to hotels in other searches.

4

Page 5: DM Assignment 2 - Group 6

Composite Features Finally, relevant new features can be created by combining existing ones. For example,

the proportion of times each hotel has been clicked and booked after appearing in search results is a good indicator

of hotel quality (to be calculated only on the training set). Hotel rankings within search or destination IDs based

on feature values are also expected to be relevant. We implemented our own algorithm to create such rankings

for several numerical features. Finally, two features were added which indicated whether a competitor offered a

cheaper booking than Expedia and what the price difference was. Other Kaggle teams indicated that the former

two variables were useful, but that the latter proved not so significant. Table 1 shows the formulas used for creating

our composite features. Some of these features are also shown in figure 4, where the distribution of bookings within

these features is depicted. It is interesting to see that indeed for the differences variables, values close to zero indicate

much more bookings.

Feature Formula Indexing

Hotel quality relevance score - mean(relevance score)i i = Search IDStar diff abs(visitor hist starrating - prop starrating)Price diff abs(visitor hist adr usd - price usd)Price his diff abs(prop log historical price - log( price usd))Comp cheap [1|(comp ratei < price usd AND comp infi = 1)] Competitor i = 1 : 8Comp cheap diff [max(comp ratei) | comp cheap = 1] Competitor i = 1 : 8Star monotonic abs(prop star - mean(prop star[booking bool])Review monotonic abs(prop review - mean(prop review[booking bool])Feature Ranked hotel rank(feature)i i = Search IDFeature Mean mean(feature)i i = Search/Destination IDFeature Normalized feature - mean(feature)i i = Search/Destination ID

Table 1. Table with the various transformed and composite features, their formulas and logistic regression results. Whereapplicable, formulas were applied over specific feature categories through indexing. Ranking and Normalization were imple-mented over several numerical features.

3 Feature Selection

To get a first indication of the relative importance of our features we use a logistic regression of the features on

whether a property was booked. We start with a set of indepedent variables that includes all features. From there,

we delete insignificant features one by one and re-run the logistic regression (at the 5% significance level). We keep

doing this untill we are left with a set of features that all have a significant effect on booking. Table 2 shows the

parameter estimates for the final set of features left in the logistic regression. The results provide an indication of

which features seem to have large predictive power for booking. Interestingly, a lot of the features normalised over

search id (ID) and destination id (DEST) are suggested to be important. Additionally, the mean star rating and

review per search id are also good predictors. The numerical features ranked within a search id also seem to be

very good features. While star difference and price difference are also included in the final set of features, comp.

cheap difference and price historical difference seem to be less important as they were not significant. But above all,

our hotel quality feature is the strongest predictor of booking, which is what we had expected. In general we find

the signs of the parameters in the direction we would have expected. A few seem to be in the opposite direction,

however these features are all in the set in multiple ways which probably means they are interacting with each

other.

5

Page 6: DM Assignment 2 - Group 6

Feature Parameter Estimate Feature Parameter Estimate

intercept -4.9168 rank - star rating 0.0180meanID - star rating -0.0684 rank - comp. cheap diff -0.0158meanID - review 0.4096 rank - location score 1 -0.0199normalisedID - starrating 0.2851 rank - location score 2 0.0263normalisedID - review 0.4200 rank - price -0.0735normalisedID - locationscore2 1.9844 star diff. 0.0751normalisedDEST - starrating 0.0222 price diff. -0.0020normalisedDEST - review -0.2323 location score 2 -0.9035normalisedDEST - location score2 1.2601 hotel quality 12.7953location score 1 0.0883

Table 2. Final result from logistic regression on booking. The parameter estimates of the regression give an indication ofthe relative importance of features

4 Modelling Approach

We implemented a step-wise modelling approach to the expedia hotel ranking problem, on which we shall now

elaborate. First, we used a logistic regression to determine which features seem most relevant. Second, we evaluated

various ranking algorithms on 5% of our training set. Third, we looked at some build-in normalisation procedures.

We trained our chosen final model on 10% of the training data set, in order to make our prediction.

4.1 Ranking models

For ranking problems, we found that most competitors in the Kaggle competition used a very efficient package in

Java: RankLib, which includes several algorithms made specifically for learning to rank. RankLib consists of the

following algorithms: RankNet, RankBoost, AdaRank, Coordinate Ascent, Random Forest, ListNet, MART and

LambdaMART. In what follows, we explain how these algorithms work, which we expect to perform best and why.

RankNet is a pair-wise ranking algorithm based on neutral networks. Each pair of correctly ranked documents,

the document is propagated through the net separately [2]. The difference for these documents is then mapped to

a logistics function to obtain a probability and the true label for that pair. Finally, all weights in the network are

updated with an error back propagation and a gradient descent method.

RankBoost also uses a pair-wise boosting technique, where training proceeds in rounds [2]. All documents start

with equal weights and each round the learner selects the weak ranker which the smallest pair-wise loss on the

training data. Pairs that are incorrectly ranked obtain more weight, such that the algorithm focuses on these pairs

in the next round. The final model then consists of a linear combination of these weak rankers.

AdaRank works in essentially the same way as RankBoost, except that is list-wise rather than pair-wise. The

advantage is that it directly maximizes any information retrieval metric, such as NDCG in our case. This could

prove to be an advantage over RankBoost for our purposes.

While coordinate ascent is often used for unconstrained optimization, Metzler and Croft have proposed a different

version of the algorithm used for information retrieval [4]. It cycles through each parameter and optimizes over it

while keeping all other parameters fixed. When implemented in a list-wise linear model this technique can be used

for ranking.

6

Page 7: DM Assignment 2 - Group 6

A Random Forest is an ensemble of decision trees. Since single decision trees are likely to overfit when made

too big, but underfit when made to small, averaging over a set of decision trees can balance out these effects. This

method is very efficient since there are very few parameters to tune.

ListNet is a learning method for optimizing loss function with neural networks as model and gradient descent as

algorithm, which is very similar to RankNet [1]. Instead of using pair-wise documents as instances it uses documents.

MART (multiple additive regression trees) is a gradient boosted tree model [1]. In a sense, it is more of a class

of models than a single algorithm. The underlying model for our MART is the least squares regression tree and it

uses the gradient descent as optimization algorithm.

Last, the best model found in the literature for ranking is often claimed to be LambdaMART. It is a combinaton

of LambdaRank (improved version of RankNET) and MART and it uses LambdaRank to model the gradients and

MART to work on these gradients [1]. Combined we obtain a MART model that uses Newton-Rhapson method for

approximation. The decisions for splits at a certain node are computed using all the data that falls into that node.

This makes it able to choose splits and leaf values that may decrease local utility but increase overall utility [1].

For our initial specification of which ranking model performed best on the Expedia data set we split our data

into three different sets: the training set, the validation set and our own test set. Since the total training data set

contained almost 5 million rows, we sampled a subset from this to use for our model building. First, we randomly

sample 10,000 search id’s from the entire training dataset which amounts to approximately 200,000 rows. From this,

75% is used as the actual training data and 25% is used by the ranking algorithms for validation. Second, we also

sample 10,000 search id’s from the training set which we keep entirely separate and can then be used as our own

test set.

4.2 Normalization procedure

Once determining the model, we also test whether several normalization procedures would improve the results. We

try the following methods: normalization by sum (1), normalization by zscore (2) and linear normalization.

xsum =x∑x, xz =

x− µσ

, xlinear =x− xmin

xmax − xmin

4.3 Final prediction model

Once we have chosen the best ranking model and normalization procedure, we train our final model on a training

set of 20,000 search id’s which amounts to roughly 500,000 rows. We also tweak the parameters to find the optimal

parameters settings for our ranking problem. We use this model to create our final prediction on the test set as

provided.

5 Results

First we train the model using 9 different algorithms from the Java implementation RankLib. We train the model

on roughly 5% of the data and test the results on an equal amount. We start with the default settings in RankLib

7

Page 8: DM Assignment 2 - Group 6

for all the models to get a feeling for which class of models performs best for our ranking problem. Table 3 shows

the results for the training set, validation set and test set. From the table we can see that LambaMART is the best

performing model, followed by MART and Random Forest 1. The two neural network models RankNet and ListNet

perform very poorly. For almost all the models it holds that the results on training and validation sets are higher

than for the test set, which means we are slightly overfitting.

Model Training Set Validation Set Test Set

MART 0.5547 0.5355 0.4868RankBoost 0.4877 0.4727 0.4531AdaRank 0.5094 0.5066 0.4612RankNet 0.3497 0.3424 0.3495Coordinate Coordinate 0.5127 0.5105 0.4641LambdaRank 0.3498 0.3389 0.3502LambaMART 0.5627 0.5409 0.4920ListNet 0.3498 0.3389 0.3502Ranom Forest 0.5595 0.5242 0.4813

Table 3. Results for 9 ranking models on the training data, validation and test data measured in NDCG@38. Training dataconsists of 7,500 random search id’s, validation set of 2,500 search id’s and test set of 10,000 search id’s.

In line with reports from previous winners of the Kaggle competition, we also find that LambdaMART performs

best for this ranking problem. Next we want to evaluate whether normalizing the entire feature set, using different

procedures, further improves the model. Table 4 shows the NDCG scores for the normal model, the sum normaliza-

tion, the zscore normalization and the linear normalization (as described in the previous section). Interestingly, we

find that the model with no normalization performs best, although the linear normalization comes very close and

outperforms the rest on the the training and validation set.

Normalization Training Set Validation Set Test Set

LambaMART - normal 0.5627 0.5409 0.4920LambaMART - sum 0.5659 0.5396 0.4892LambaMART - zscore 0.5564 0.5399 0.4890LambaMART - linear 0.5675 0.5478 0.4915

Table 4. Results for different normalization procedure on the training data, validation and test data measured in [email protected] data consists of 7,500 random search id’s, validation set of 2,500 search id’s and test set of 10,000 search id’s.

Finally, we investigated whether we could further fine-tune our LambdaMART model with no normalization to

improve on our score. We checked whether increasing the number of trees from 1000 to 2000 or 3000 would improve

the model, but the scores were exactly the same. Using the optimal model specifications found so far, we ran a final

LambdaMART model on a training set of 40,000 search id’s, and found 0.5659 on the training set, 0.5505 on the

validation set and 0.4977 on the test set 2. We also performed a five-fold cross validation on this final model, for

which the results can be found in table 5. Fold 1 seems to perform best, based on its test set. Making a prediction

with this model is expected to lead to a slightly higher NDCG score.

1 Although LambaMART has the highest NDCG scores in the table, our prediction was made with a MART model. Thisbecause there was an error in our LambaMART implementation, which we only found after the deadline for handing inour prediction had passed

2 40,000 search id’s was the maximum number possible, given our computing power constraints.

8

Page 9: DM Assignment 2 - Group 6

Cross-validation fold Training Set Validation Set Fold Test Set

LambdaMART - fold 1 0.5704 0.5354 0.5460LambdaMART - fold 2 0.5644 0.5340 0.5457LambdaMART - fold 3 0.5752 0.5368 0.5383LambdaMART - fold 4 0.5693 0.5339 0.5451LambdaMART - fold 5 0.5770 0.5551 0.5331Average 0.5713 0.5382 0.5416

Table 5. Results for five-fold cross-validation on the training data and validation data measured in NDCG@38. Trainingdata consists of 7,500 random search id’s and validation set of 2,500 search id’s. The fold test error is the error on a separatetest held apart in the cross-validation, which is calculated after completing every fold.

6 Conclusion

6.1 Summary of main findings

The main conclusion that can be drawn from our results is that LambdaMART is the best model for learning to

rank hotels such that bookings are maximised. This conclusion is in line with what we find in the literature and

with the top performing teams in the Kaggle competition. Of the original features, the winning team stated that

the second location score, the price and the ranking position were the most important features. Our combination of

logistic regression and incremental model adjustments mostly agreed with this, although the rank position did not

work out as well, and instead pointed to the review score as relevant. Of our composite features, the hotel quality,

difference features, normalised and mean features and value rankings proved significant predictors. Overall, what

the analysis shows is that by far the most value lies in the transformed and composite features. This suggests that

we were right to focus on the feature creation and selection process, and could even have experimented with many

more new features.

6.2 Suggestions for further improvement

Although we obtained a relatively powerful model for prediction the rank of hotels on Expedia there are several

suggestions for further improvement on which we would like to briefly elaborate. First, although we tried to create

a balanced dataset with roughly equal amount of negative and positive outcomes we did not succeed to properly do

this. We think doing this properly could really improve the model, as most of the winners emphasize the importance

of doing this. Second, all our feature engineering was now performed only on the training set. However, it would

have been even better to combine the training and test set and create the ranking features, de-meaned features

and means per property id and destination on the entire dataset. Third, we tried to create an extra feature that

measures the average position of a hotel over all the search queries. We think this could be a very important feature,

however it seemed not to have a significant improvement of the model. We still think this could be a very important

feature so we suspect something might have gone wrong in computing it. Therefore we would suggest to further

investigate the potential of this variable. Last, on a more practical note we had some difficulties with the enormous

dataset for this assignment. Due to computational constraints on our Macbooks (running out of memory) we were

only able to train our models on maximally 10% of the dataset. Our scores would probably have improved some if

we could train on the entire training set, making use of external computing power for example.

9

Page 10: DM Assignment 2 - Group 6

6.3 Process Evaluation

Looking back at the process there are a couple of things we would do differently next time. First, it would have been

better if we had spent more time in the beginning to explore which model would be best to use for this problem and

what would be the best software to implement it in. This would have prevented us from working on programming

models that we were not able to use in the end. Second, we should have started with a much more simple model,

including just a few features and making the prediction work with this model first. Because we lost some time

figuring out how the Java package worked, we had already created a rich set of features which we then put in our

model. However, this made it very difficult to find small errors and mistakes we made and caused our predictions

to be very bad for a long time. If we had started with a simple mode first to make sure everything worked properly

we probably would have had more time at the end to further improve the model. This also resulted in the fact that

we made our prediction with a MART model, while LambdaMART would have further improved on this, but we

only found the mistake after the deadline for handing in the prediction file.

References

1. Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23–581, 2010.

2. V Dang and W Bruce Croft. Feature selection for document ranking using best first search and coordinate ascent. In

Sigir workshop on feature generation and selection for information retrieval, 2010.

3. Xudong Liu, Bing Xu, Yuyu Zhang, Qiang Yan, Liang Pang, Qiang Li, Hanxiao Sun, and Bin Wang. Combination of

diverse ranking models for personalized expedia hotel searches. arXiv preprint arXiv:1311.7679, 2013.

4. Donald Metzler and W Bruce Croft. Linear feature-based models for information retrieval. Information Retrieval,

10(3):257–274, 2007.

10

Page 11: DM Assignment 2 - Group 6

7 Process Report

Our group consists of Dennis and Hidde who are both studying econometrics, while Dennis also followed the course

on Neural Networks. We both have by far the most programming experience in MATLAB. We have worked on this

assignment for a total of five weeks and we generally worked together on the assignment at the same time. First we

will go through what we did week by week and then we will reflect on the process and what we might do different

next time.

In week 1 we got handed over the assignment and started with some exploration of the assignment. The task of

ranking hotels for Expedia seemed like a very interesting and practical application of data mining to us. Hidde read

through the reports written by the winners of the Kaggle competition to get a feeling for well-working algorithms

and to understand the task at hand better. Hidde also went through the slides of the previous lectures to see which

techniques would be useful for this assignment. At the same time Dennis started examining the data to see how we

could import these into MATLAB. He also looked into the possibilities of using his previous experience in neural

network models for this assignment.

In week 2 we started thinking about the first models we could use for this problem and made a first attempt

at programming these in MATLAB. While Hidde worked on an implementation for Random Forests, Dennis tried

to train a multilayer perceptron (MLP) on a small part of the dataset. We started with the modelling part before

feature preparation and selection to get a feeling for how complicated it was going to be to use these models for

prediction and what the computing time would be. We soon discovered that building these models ourself cost a lot

of time and computing time in MATLAB was very long. Therefore, we decided to change our approach. Instead of

trying to build a model from the knowledge we had, we looked for implementation packages of some of the ranking

models used by the winners of the Kaggle competition such as (Lambda)MART. We found a very good package

written in Java, RankLib, that most of very well performing teams in the Kaggle competition had used. Although we

had no experience in programming in Java we decided to go for this package such that, once we would understand

the package and be able to use it, we could focus on the feature building and selection. This seemed like the most

important driver for getting a good score.

In week 3 we focused first on understanding the RankLib package, so we would be sure that we would be able

to hand in a prediction on May 17. While Dennis familiarised himself with Java and figured out the technical part

of getting RankLib to work, Hidde investigated the various algorithms in the package to understand how they

worked and which might work best. Hidde also continued with the data preparation, which we decided to keep

doing in MATLAB. We would then export the dataset to Java to train our models and import back into MATLAB

to determine the final ranking and create the prediction file. While we worked on being able to train our first

model in Java we now also started working intensively on missing data, removing outliers, creating new features

and transforming existing features.

By the time we reached week 4 we finally had our Java program working and we could train our first models.

We compared the various ranking models in RankLib on a small training data set and quickly found MART was

rather fast and also performed well. Dennis worked on creating ranked numerical features within search query’s and

Hidde combined features such as competitive prices together to make more powerful features. Dennis also wrote

11

Page 12: DM Assignment 2 - Group 6

the code in MATLAB for drawing a random sample of roughly 5% of the training data on which we trained our

models. Hidde trained a variety of models in Java and tested on our own test set, which also consisted of 5% of the

training set. However, we experienced a lot of trouble with training a proper model because we spent a lot of time

de-bugging our Java implementation. We decided to verify whether our model was able to make a proper prediction

by predicting the Kaggle test set and uploading our prediction to check the score. Finally, we managed to get a

score on Kaggle of 0.48457 which would have been the 100th place. By this point however we had very little time

left to fine-tune our model or train it on a large dataset, because we had to hand in our prediction.

Week 5 was the week of the final lecture in which we spent time on preparing the presentation and mostly

worked on the final report. Dennis wrote the parts on data exploration and feature building, while Hidde elaborated

on the modelling approach. Together we wrote the introduction and conclusion and finished the report.

Reflecting on our cooperation as a team, the collaboration between Dennis and Hidde was very good. Since we

know each other very well it was easy to work together and use each other strengths. While Hidde has a somewhat

stronger theoretical background in econometrics at the VU, Dennis was able to use his programming experience

from the neural networks course. Although we both had very busy schedules we managed to work on the assignment

on a regular basis, which was helped by the fact that we live in the same house. We chose deliberately to work in

a team of two rather than three, because we are both so busy and it would have been hard to find proper meeting

times with a third person. However, the downside was that we had to do more of the work ourselves instead of being

able to divide it among three people. Nevertheless, we think the time we won by working in a team that knows each

other very well outweighs this extra work. A possible pitfall of knowing each other well is that you might overlook

some mistakes or opportunities because you have a too similar way of thinking.

Overall, we can look back at an interesting and very practical course on data mining in which we learned a lot

about different methods but also on the data mining process itself. The very practical application of the Kaggle

assignments definitely made data mining come to life for us and contributed strongly to our enthusiasm. Although

frustrated during the process at times, we both finish the course with a lot of new knowledge, experience and

satisfaction.

12