13
13.09.2012 DIMA TU Berlin 1 Database Systems and Information Management Group (DIMA) Technische Universität Berlin http://www.dima.tu-berlin.de/ An Introduction to Collaborative Filtering with Apache Mahout Sebastian Schelter Recommender Systems Challenge at ACM RecSys 2012

Introduction to Collaborative Filtering with Apache Mahout

Embed Size (px)

Citation preview

Page 1: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 1

Database Systems and Information Management Group (DIMA)Technische Universität Berlin

http://www.dima.tu-berlin.de/

An Introduction to Collaborative Filtering with Apache Mahout

Sebastian Schelter

Recommender Systems Challenge at ACM RecSys 2012

Page 2: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 2

■ Apache Mahout: apache-licensed library with the goal to provide highly scalable data mining and machine learning

■ its collaborative filtering module is based on the Taste framework of Sean Owen

■ mostly aimed at production scenarios, with a focus on□ processing efficiency

□ integratibility with different datastores, web applications, Amazon EC2

□ scalability, allows computation of recommendations, items similarities and matrix decompositions via MapReduce on Apache Hadoop

■ not that much used in recommender challenges□ not enough different algorithms implemented?

□ not enough tooling for evaluation?

→ it‘s open source, so it‘s up to you to change that!

Overview

Page 3: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 3

Preference & DataModel

■ Preference encapsulates a user-item-interaction as (user,item,value) triple□ only numeric userIDs and itemIDs allowed for memory efficiency

□ PreferenceArray encapsulates a set of preferences

■ DataModel encapsulates a dataset□ lots of convenient accessor methods like getNumUsers(),

getPreferencesForItem(itemID), ...

□ allows to add temporal information to preferences

□ lots of options to store the data (in-memory, file, database, key-value store)

□ drawback: for a lot of usecases, all the data has to fit into memory to allow efficient recommendation

DataModel dataModel = new FileDataModel(new File(„movielens.csv“));

PreferenceArray prefsOfUser1 = dataModel.getPreferencesFromUser(1);

Page 4: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 4

Recommender

■ Recommender is the basic interface for all of Mahout‘s recommenders□ recommend n items for a particular user

□ estimate the preference of a user towards an item

■ a CandidateItemsStrategy fetches all items that might be recommended for a particular user

■ a Rescorer allows postprocessing recommendations

List<RecommendedItem> topItems = recommender.recommend(1, 10);

float preference = recommender.estimatePreference(1, 25);

Page 5: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 5

Item-Based Collaborative Filtering

■ ItemBasedRecommender□ can also compute item similarities

□ can provide preferences for items as justification for recommendations

■ lots of similarity measures available (Pearson correlation, Jaccard coefficient, ...)

■ also allows usage of precomputed item similarities stored in a file (via FileItemSimilarity)

ItemBasedRecommender recommender =

new GenericItemBasedRecommender(dataModel,

new PearsonCorrelationSimilarity(dataModel));

List<RecommendedItem> similarItems =

recommender.mostSimilarItems(5, 10);

Page 6: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 6

Latent factor models

■ SVDRecommender□ uses a decomposition of the user-item-interaction matrix to compute

recommendations

■ uses a Factorizer to compute a Factorization from a DataModel, several different implementations available

□ Simon Funk‘s SGD

□ Alternating Least Squares

□ Weighted matrix factorization for implicit feedback data

Factorizer factorizer = new ALSWRFactorizer(dataModel, numFeatures,

lambda, numIterations);

Recommender svdRecommender =

new SVDRecommender(dataModel, factorizer);

List<RecommendedItem> topItems = svdRecommender.recommend(1, 10);

Page 7: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 7

Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator□ allow to measure the prediction quality of a recommender by using a

random split of the dataset

□ support for MAE, RMSE, Precision, Recall, ....

□ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data

RecommenderEvaluator maeEvaluator = new

AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(

new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),

new InteractionCutDataModelBuilder(maxPrefsPerUser),

dataModel, trainingPercentage, 1 - trainingPercentage);

Page 8: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 8

Evaluating recommenders

■ RecommenderEvaluator, RecommenderIRStatsEvaluator□ allow to measure the prediction quality of a recommender by using a

random split of the dataset

□ support for MAE, RMSE, Precision, Recall, ....

□ need a DataModel, a RecommenderBuilder, a DataModelBuilder for the training data

RecommenderEvaluator maeEvaluator = new

AverageAbsoluteDifferenceRecommenderEvaluator();

maeEvaluator.evaluate(

new BiasedRecommenderBuilder(lambda2, lambda3, numIterations),

new InteractionCutDataModelBuilder(maxPrefsPerUser),

dataModel, trainingPercentage, 1 - trainingPercentage);

Page 9: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 9

Starting to work on Mahout

■ Prerequisites□ Java 6

□ Maven

□ svn client

■ checkout the source code from

http://svn.apache.org/repos/asf/mahout/trunk

■ import it as a maven project into your favorite IDE

Page 10: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 10

Project: novel item similarity measure

■ in the Million Song DataSet Challenge, a novel item similarity measure was used in the winning solution

■ would be great to see this one also featured in Mahout

■ Task □ implement the novel item similarity measure as subclass of Mahout’s

ItemSimilarity

■ Future Work□ this novel similarity measure is asymmetric, ensure that it is correctly

applied in all scenarios

Page 11: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 11

Project: temporal split evaluator

■ currently Mahout‘s standard RecommenderEvaluator randomly splits the data into training and test set

■ for datasets with timestamps it would be much more interesting use this temporal information to split the data into training and test set

■ Task □ create a TemporalSplitRecommenderEvaluator similar to the existing

AbstractDifferenceRecommenderEvaluator

■ Future Work□ factor out the logic for splitting datasets into training and test set

Page 12: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 12

Project: baseline method for rating prediction

■ port MyMediaLite’s UserItemBaseline to Mahout(preliminary port already available)

■ user-item-baseline estimation is a simple approach that estimates the global tendency of a user or an item to deviate from the average rating (described in Y. Koren: Factor in the Neighbors: Scalable and Accurate Collaborative Filtering, TKDD 2009)

■ Task □ polish the code

□ make it work with Mahout’s DataModel

■ Future Work□ create an ItemBasedRecommender that makes use of the estimated

biases

Page 13: Introduction to Collaborative Filtering with Apache Mahout

13.09.2012 DIMA – TU Berlin 13

Thank you.

Questions?

Sebastian SchelterDatabase Systems and Information Management Group (DIMA)Technische Universität Berlin