Building and Evaluating an Adaptive Real-time …uu.diva-portal.org/smash/get/diva2:758781/FULLTEXT01.pdfBuilding and Evaluating an Adaptive Real-time Recommender System Jeff Nkandu

IT 14 065

Examensarbete 30 hpOktober 2014

Building and Evaluating an Adaptive Real-time Recommender System

Jeff Nkandu

Institutionen för informationsteknologiDepartment of Information Technology

(This page is intentionally left blank)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Building and Evaluating an Adaptive Real-timeRecommender System

Jeff Nkandu

Most recommender algorithms in use today are slow to adapt to changes in userpreferences. This is because they are focused towards model-building and offlinecalculation of recommendations. The fact that they require large amounts ofinformation about users before they can make sensible recommendations does nothelp their case either. This work proposed an adaptive prediction scheme that makesreal-time recommendations to users. The scheme was developed by KristiaanPelckmans[1]. It is real-time in that it calculates new recommendations every time auser submits some side information. It is adaptive in that it maintains an onlinememory of user activities which evolves as user preferences change. In this work, thecurrent start-of-the-art in the implementation of recommender systems isinvestigated. The adaptive prediction scheme is explained in detail. Its applicability indriving a recommender system is evaluated in comparison with other “established”recommender algorithms. Using a movie recommender system implemented using thescheme, it is shown that the scheme relies on much less data in order to makerecommendations and the quality of its recommendations is slightly better than thecommon recommender algorithms which are based on collaborative filtering. Lastly,the scheme’s limitations are highlighted and recommendations for future work aremade.

Tryckt av: Reprocentralen ITCIT 14 065Examinator: Ivan ChristoffÄmnesgranskare: Thomas SchönHandledare: Kristiaan Pelckmans

(This page is intentionally left blank)

Contents

1 Introduction 6

1.1 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . 61.1.1 Collaborative filtering . . . . . . . . . . . . . . . . . . . . 81.1.2 Content-based filtering . . . . . . . . . . . . . . . . . . . . 91.1.3 Hybrid filtering . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Theoretical Review 12

2.1 What is Similarity? . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.1 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . 132.1.2 Pearson Correlation (PC) Coefficient . . . . . . . . . . . . 13

2.2 Recommender Algorithms in Practice . . . . . . . . . . . . . . . 142.2.1 User-based CF . . . . . . . . . . . . . . . . . . . . . . . . 152.2.2 Item-based CF . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Slope-One Algorithm . . . . . . . . . . . . . . . . . . . . . 162.2.4 Latent Factor Models . . . . . . . . . . . . . . . . . . . . 17

2.3 Recommender Systems at Netflix . . . . . . . . . . . . . . . . . . 182.4 Evaluating Recommender Systems . . . . . . . . . . . . . . . . . 18

2.4.1 RMSE and MAE . . . . . . . . . . . . . . . . . . . . . . . 202.4.2 Normalized Discounted Cumulative Gain (NDCG) . . . . 21

3 The Adaptive Real-time Movie Recommender 22

3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 The Adaptive Predictor . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.2 The Adaptive Predictor Algorithm . . . . . . . . . . . . . 243.2.3 Tuning the gamma value . . . . . . . . . . . . . . . . . . 24

3.3 The Movie Recommender . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Console Application . . . . . . . . . . . . . . . . . . . . . 253.3.2 The Web Application . . . . . . . . . . . . . . . . . . . . 25

3.4 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1

4 Experimental Results 31

4.1 Optimal Gamma Value . . . . . . . . . . . . . . . . . . . . . . . . 314.2 NDCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 User Feedback on Recommendations Quality . . . . . . . . . . . 34

5 Evaluation and Analysis 36

6 Conclusion and Future Work 38

6.1 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . 386.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Bibliography 39

A Apache Mahout 44

B The Movie Recommender Website 47

2

List of Figures

2.1 Overall architecture of the Netflix movie recommender system(Source: Netflix, Inc [1]) . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Flow chart showing interaction between the user and the movierecommender web application (Use case 1) . . . . . . . . . . . . . 27

3.2 Flow chart showing interaction between the user and the movierecommender web application web application (Use case 2) . . . 28

4.1 Example plot of rQ against gamma(γ = 10x) . . . . . . . . . . . 324.2 Adaptive predictor vs Item-based and User-based recommenders,

using the first 100 ratings from the test set of Dataset 1 . . . . . 324.3 Adaptive predictor vs Item-based and User-based recommenders,

using the first 1000 ratings from the test set of Dataset 1 . . . . 334.4 Chart showing number of movies rated by users who gave high

recommendations quality ratings . . . . . . . . . . . . . . . . . . 344.5 Chart showing number of movies rated by users who gave bottom

10 low recommendations quality ratings . . . . . . . . . . . . . . 35

A.1 Overall Architecture of the Apache Mahout User-based Collabo-rative Filtering Engine (Source: Apache Mahout Official Website[2]) 46

B.1 Screen shot of the Login page [index page] . . . . . . . . . . . . . 48B.2 Screen shot of the user Sign Up page . . . . . . . . . . . . . . . . 49B.3 Screen shot of the page for presenting movie recommendations to

users and getting movie ratings from them . . . . . . . . . . . . 50B.4 Screen shot of page for gathering users’ overall quality rating of

the adaptive predictor algorithm . . . . . . . . . . . . . . . . . . 51

3

List of Tables

1.1 Example of ratings given by 4 users on 5 movies . . . . . . . . . 7

2.1 Illustration of PC coefficient similarity of users in Table 1.1 . . . 142.2 Data format accepted by the Apache Mahout data model . . . . 152.3 Example movie recommendations sorted in descending order of

predicted ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Summary of quantitative information about data sets used . . . . 29

4.1 Average NDCG values for adaptive predictor, user-based anditem-based recommenders using the first 100 ratings from thetest set of Dataset 1 and the whole of Dataset 2. . . . . . . . . 33

4

List of Algorithms

3.1 The Adaptive Predictor (AP) algorithm . . . . . . . . . . . . . . 24

5

Chapter 1

Introduction

Recommender systems are smart systems designed to predict interests or pref-erences of users of a particular system. A typical recommender system is de-signed to suggest a specific type of item to users (for example, movies, news,friends, holiday destinations). The idea behind recommender systems is to filterout information that the user might find unnecessary and help them in theirdecision-making process[3]. In this era were users are overloaded with so muchinformation, recommender systems are becoming a cardinal component of anyproduct or service on offer[4]. Recommender systems are found in many soft-ware environments nowadays, including online services such as Amazon, Spotify,Facebook, Netflix and e-Bay, to name a few.

Research in recommender systems is almost as old as the World Wide Web(WWW) per se, and as such different techniques and algorithms in this areahave been developed. In fact, most techniques used to implement recommendersystems are part of a wider area of study called information filtering, whereresearch has been going on for a long time[5].

This chapter introduces the different types of recommender systems andgives a background of this study, identifying what problems exist in the currentapproaches used. It then describes the proposed solution and concludes bygiving an overview of this thesis.

1.1 Recommender Systems

As most organizations move their businesses to the Internet (wholly or in part),personalized recommendation of products and services has become a cardinalcomponent of their Internet presence. Over the years, a number of algorithmsand approaches have been developed to handle the problem of recommendingitems to users. By far, the most common way in which user preferences arepredicted in recommender systems is by estimating ratings of items the userhas not yet seen. These estimates are calculated based on ratings that the userhas given on other items or ratings given by users with similar interests. A few

6

Godfather The Matrix Shutter Island Lion King 1 The Notebook

John 5 1 2 2Mary 1 5 2 5 5Lisa 2 3 5 4Kim 4 3 5 3

Table 1.1: Example of ratings given by 4 users on 5 movies

items with the highest estimated ratings are recommended to the user. Apartfrom ratings, some contextual information such as time and location may alsobe used for estimating ratings[6]. Table 1.1 , from [3], shows an example ofratings given by users on a set of items (movies).

In the example above, the range of possible ratings a user can give to a movieis 0 (minimum, not liked) to 5 (maximum, very liked). Cells with empty ratingsmean that the user has not yet seen the associated movie. In its simplest form,the aim of a recommender system is to fill in the empty cells with predictedratings. This example will be developed further in Chapter 2.

Most techniques used in recommender systems consist of three stages: (1)collecting a large amount of information (directly or indirectly) about users andtheir behaviors. (2) Then, this database is used for building an offline datamodel [7] storing the correlation between the user and the item(s) on offer. (3)Based on this model, personalized recommendations to users are made to themthe next time they use the service.

According to Adomavicius and Tuzhilin [6], the recommendation problemcan formally be formulated as follows: let C be a set of all users in a systemand S be a set of all possible items that can be recommended to them. Alsolet u be a utility function that measures how much user c ∈ C likes item s ∈ S,i.e., u : C × S → R, where R is an ordered set of non-negative integers or realnumbers in a predefined range. For each user c, we then look for an item s′ thatmaximizes the utility function, as in equation 1.1:

∀c ∈ C, s′

c = arg maxs∈S

u(c, s) (1.1)

The utility function u estimates a measurement such as a rating (by far, themost commonly used) that a user may give to each unseen item s, and then items′ ∈ S with the highest estimated rating is recommended to the user as shownin equation 1.1. Alternatively, the first N items with the highest estimatedratings may be presented to the user as recommendations. For items the userhas already seen, u(c, s) is a constant value given by a user e.g. Mary gave ItemX a rating of 3 (out of 5). For items unseen by the user, the utility function maybe based on methods from machine learning, approximation theory and otherheuristics. Most of these methods aim to optimize a performance criterion suchas the mean square error (MSE)[6].

Recommender systems are generally classified into three (3) major categoriesbased on the approach used to predict user preferences[8]. These are as follows:

7

• Collaborative filtering: a user will be recommended items based on whatother people who share a similar taste have liked in the past.

• Content-based filtering: a user will be recommender items based on whatthey have liked in the past.

• Hybrid filtering: this approach combines aspects of both collaborative andcontent-based filtering.

Another distinction usually made among recommender systems, which will notbe explored further in this text, is between memory-based (or heuristic based)and model-based recommender systems. Memory-based methods rely on a de-fined formula to estimate predicted ratings whereas model-based methods de-pend on a model learned from a data set [6]. Characteristics of each of theabove listed approaches are now discussed in detail below.

1.1.1 Collaborative filtering

This is by far the most commonly used approach in the implementation ofrecommender systems[7]. Part of its popularity was fueled by the attention theapproach received during the now-discontinued Netflix Prize competition whoseexplicit purpose was to find the best collaborative filtering algorithm. It was thefirst time that the research community in recommender systems had access tosuch a huge dataset (100 million movie ratings) of ratings from real-life users[3].This encouraged a lot of innovators to come on board to push the boundariesof recommender system algorithms.

Collaborative filtering predicts the utility value a user would give to anitem based on item ratings given by other users. Formally, [6] shows that thecollaborative filtering problem can be represented as follows: given a user c, letu(cj , s) be all the ratings assigned to item s by each user cj where ∀j : 0 ≤ j ≤

N : cj ∈ C , C is the set of all users with a similar taste as user c and N is the

total number of users in set C. The utility measure u(c, s) of item s by user c

can then be predicted as follows:

u(c, s) = aggrcj∈C u(cj , s) (1.2)

In most cases, the aggregation function aggr is defined as a weighted sumadjusted to account for different rating scales through mean centering, as follows:

u(c, s) = u(c) +1

k

∑

cj∈C

sim(c, cj) × [(u(cj , s) − u(cj)] (1.3)

where k is a normalizing factor and is usually calculated as∑

cj∈C |sim(c, cj)|.

The value u(c) is the mean rating of all items rated by user c and u(cj) isthe mean rating of each user cj . The sim function can be any similarity mea-sure. The most commonly used are cosine and correlation coefficient similaritymeasures. These will be discussed in detail in Chapter 2.

8

1.1.2 Content-based filtering

Like collaborative filtering, content-based filtering depends on similarities inorder to make recommendations. However, the similarity measurement is re-stricted to the preference history of the user for whom the item recommendationis to be made.

Specifically, content-based recommender systems calculate the similarity be-tween items unseen by the user with those that the user liked in the past basedon their features [3]. Those items with the best-matching are recommended tothe user. For example, user X may have given higher ratings to a number ofbooks by author Y. The system will in time learn to recommend more books byauthor Y to user X. The author feature is in this case being used to measure thesimilarity between items seen by the user and those unseen, and recommend tothe user unseen items that have a higher similarity score. Other book featuresthat may be used include plot, genre, character, form and setting.

Content-based filtering is mostly employed in systems that offer text-baseditems such as books, news and documents[6]. These systems will normally storean item profile of each item that can be recommended to the user. Naturally,this item profile will consist of all (or most) of the features that characterize theitems on offer. For example, a book item profile may have all the book featuresmentioned earlier. Information about items the user has liked in the past isstored in a user history. The user history will at least contain the rating givenby a user on each item they have liked.

Formally, a content-based recommender system may be represented as fol-lows [6]: let ItemProfile(s) be the item profile of item s, i.e., the set of featurescharacterizing item s. Also, let UserHistory(c) be the user history of user c,containing items that the user has liked in the past. The utility function u(c, s)can then be defined as follows:

u(c, s) = sim(UserHistory(c), ItemProfile(s)) (1.4)

The sim function in equation 1.4 can be a similarity function such as thecosine similarity measure or some other heuristics. Apart from the traditionalheuristic methods which depend on a formula to calculate the utility prediction,Bayesian classifiers and machine learning techniques such as clustering, artificialneural networks (ANN) and decision trees can also be used in content-basedrecommendation. These techniques employ a slightly different approach in thatthey calculate the utility prediction using a model learned from the user anditem data.

1.1.3 Hybrid filtering

As the name suggests, hybrid filters combine different techniques to create onerecommender. Most large scale recommender systems are hybrid in nature (forexample, the one used by Netflix, which will be discussed in detail in Chapter2). The advantage of hybrid recommenders is that they user the strengths of

9

various algorithms to create a more powerful system. Some of the ways in whichdifferent approaches can be combined are as follows[6]:

• implement a collaborative filter and a content-based filter both acting onthe same data set, and combine their predictions

• implement a collaborative filter whose input comes from a content-basedfilter

• implement a content-based filter whose input comes from a collaborativefilter

• implement a general unifying model that has both characteristics of content-based and collaborative filtering.

1.2 Background

Interest in recommender systems has increased over the years because of a num-ber of reasons. We will briefly highlight two reasons here. Firstly, the idea of arecommender system itself is inspired by the fact that such a system can helpusers discover interesting items to consume[4]. This, in turn, leads to user sat-isfaction and the service provider stands to benefit from an increased value ofservice as usage increases[9]. Secondly, [4] points out that users today are facedwith an information overload as more and more innovative services are com-ing up on the Internet. Also the established services such as Facebook, e-Bay,Amazon, Youtube, Netflix and so on are generating so much information thatit is becoming increasingly difficult for users to find what they are interestedin[4][10]. No wonder that these services are pouring out huge amounts of re-sources in the implementation of recommender systems to filter out informationthat will not be of interest to the user. One case in point is the much publi-cized $1 million-worth Netflix Prize competition aimed at developing the bestpossible collaborative filter for the Netflix (an on-demand Internet-based moviestreaming firm[11]) movie data set[12].

Current approaches to the implementation of recommender systems havetheir own challenges. Firstly, most recommender systems are not real-time innature. The calculation of predicted user interests from an existing data setis normally done offline. The next time a user accesses the service, the storedrecommendations for that user from the offline calculation are made available toher; these will not be affected by the user’s current interaction with the system.This is a problem because the reality is that user preferences do not alwaysremain the same; they change as the user is exposed to more influences andinformation, experiences a change in circumstances and so on. Recommendersystems should be able to adapt to all these changes.[7]

Another challenge is that a recommender system has to store huge quantitiesof information about a user before they can make accurate predictions about theuser’s preferences. As such, a newly implemented recommender system, withlittle or no information to work with, is bound to be faced with what is referred

10

to as the cold-start problem. This is a scenario were the recommender systemis unable to provide accurate recommendations to users because it has littleinformation about them or because it has new items which have not yet beenliked by users. This is not an easy problem to circumvent because recommendersystems are naturally heavily depended on information. One solution is tocombine recommender techniques, as is the case in hybrid recommender systems.For example, use collaborative filtering to recommend items to new users (takingadvantage of the availability of information from other users, some of whom mayhave some common attributes with the new user) and content-based filtering torecommend to old users who have rated sufficiently many items in the past.

1.3 Proposed Solution

This project examines an essentially different approach to the implementationof recommender systems, based on an adaptive coding scheme (adaptive coder)proposed by Pelckmans in [13]. Pelckmans initially developed this scheme foradaptive data compression. The scheme relates to Solomonoff’s AlgorithmicProbability (ALP) which is introduced in [14]. It’s suitability for use in adap-tive compression was thoroughly investigated in a thesis work by Supan[15].The current work now extends the usage of the adaptive coder to the area ofrecommender systems. Using movie recommendation as an example, we explorethe applicability of the adaptive coder in driving a recommender system. Forthe purpose of this study, we will refer to the aforementioned adaptive coder asadaptive predictor (AP) in the rest of this text.

While this AP scheme was originally proposed in a context of compression,its use in a recommender setting is immediate. The scheme, as used in thisstudy, provides real-time recommendations to users based on relevant side in-formation. The recommendation is done in real-time in that the user preferencesare calculated as the user interacts with the system. It is adaptive because rec-ommendations are updated as the user provides more side information. All thishelps the proposed scheme to capture any changes in the user’s taste and ensurethat the given recommendations are always current from the user’s perspective.

1.4 Thesis Overview

This thesis is organized as follows: Chapter 2 reviews different recommendersystems in use today, exploring what approaches and techniques are in use.Chapter 3 explains the adaptive predictor on which this work is focused, usingmovie recommendation as a running example. Chapter 4 gives the experimentalresults and Chapter 5 analyzes and discusses these results. Chapter 6 providesthe conclusion and highlights any possible future extensions to this work.

11

Chapter 2

Theoretical Review

The previous chapter introduced the three major types of recommender sys-tems: collaborative, content-based and hybrid systems. This is by far the mostcommon classification of recommender systems. However, other classificationsexist. For example, a classification which is of interest to this study groups rec-ommender systems according to whether they are offline or online[7]. As hasbeen understood from the literature reviewed in this study, large-scale imple-mentations of recommender systems are hybrid in nature and have both offlineand online components. A classic example, which will also be covered in thischapter is that by Netflix and is presented in [1].

One concept that is central to the implementation of most recommendersystems is the concept of similarity. This chapter starts by exploring similarityas a mathematical concept in the context of recommender systems before delvinginto the central algorithms, techniques and methods in the implementation ofreal-life, large-scale recommender systems such as that at Netflix.

2.1 What is Similarity?

According to [16]: “two objects are similar if they are referenced by similarobjects”. This definition fits perfectly in the context of recommender systemsbecause their explicit aim is to group objects (users) based on what other “sim-ilar” objects (items) they reference, or vice-versa. Similarity in recommendersystems can refer to different features such as meta-data/tags, user play be-havior and user rating behavioral [17]. As noted earlier, recommender systemsexpress the degree of similarity between objects as a utility measure calculatedusing various heuristic functions. The implementation of the heuristic func-tion depends on the problem domain and on whether the recommendation isimplemented as a collaborative, content-based or other form of filter. The sim-ilarity measure aims to build a neighborhood of like-minded users, or objectswith common attributes. The neighborhood building process constitutes thelearning process for the recommender system algorithm [18].

12

We now discuss two (2) commonly used heuristic functions in recommendersystems: (1) cosine and (2) correlation coefficient. A number of other similaritymeasures are used in recommender systems but will not be discussed in thistext. These include: Spearman correlation, Tanimoto coefficient and Euclideandistance.

2.1.1 Cosine Similarity

In trigonometry, the cosine of an angle is a ratio of the length of the adjacentside to the hypotenuse. The behavior of the cosine is such that it tends towards+1 as the angle between the two sides decreases, where as it tends towards -1when the angle increases. The same applies in cosine similarity. Values closeto +1 suggest a close similarity where as those close to -1 can mean there is nosimilarity at all.

Building from equation 1.3, cosine similarity (sim function) in collaborativefilters can be represented as follows: given that S is the set of all items inthe data set, let Sxy be the set of all items rated by both user x and user y.Also, let ~x and ~y be two vectors representing users x and y respectively. Thecosine similarity between the two user vectors is the cosine angle between them,calculated as follows:

sim(x, y) =~x.~y

||~x|| × ||~y||=

∑

s∈Sxy

rx,sry,s

√

∑

s∈Sxy

r2x,s

√

∑

s∈Sxy

r2y,s

(2.1)

where r is the rating provided by a user on an item.The same cosine formula is used for content-based filters but just as an

intermediate step, as discussed in [6].

2.1.2 Pearson Correlation (PC) Coefficient

Correlation is simply a linear relationship between two variables. A correlationcoefficient is a number that measures the strength of that relationship. ThePearson coefficient, like cosine similarity measure, has a range of -1 to +1 andthe interpretation of the values is the same[19].

Generally, given two data points a and b with covariance Σ, the PC coeffi-cient is computed as follows:

PC(x, y) =Σ(x, y)

σx × σy

(2.2)

where σ is the standard deviation.PC is the most commonly used coefficient measure in recommender systems.

PC coefficient in a recommendation algorithm is calculated as shown below,

13

John Mary Lisa Kim

John 1.000 -0.938 -0.839 0.659Mary -0.938 1.000 0.922 -0.787Lisa -0.839 0.922 1.000 -0.659Kim 0.659 -0.787 -0.659 1.000

Table 2.1: Illustration of PC coefficient similarity of users in Table 1.1

using the same variables as in equation 2.1:

sim(x, y) =

∑

s∈Sxy

(rx,s − rx)(ry,s − ry)

√

∑

s∈Sxy

(rx,s − rx)2√

∑

s∈Sxy

(ry,s − ry)2(2.3)

where r is the mean rating of the user on all items s ∈ Sxy[3][6]. Using ratingsfrom the example in table 1.1, table 2.1 illustrates results from a PC computationusing equation 2.3.

A number of conclusions can be made from the results in table 2.1. Forexample, the results suggest that Lisa and Mary share a very similar taste inmovies where as John and Mary may actually not like any movie in common atall.

For a collaborative filter, the results from the similarity measure will nor-mally be used in the aggregation function in equation 1.3 to predict ratings foritems unseen by a user.

2.2 Recommender Algorithms in Practice

We have seen that similarity-based recommender systems measure how “alike”objects are so as to generate recommendations. These objects may either beusers or items. This notion forms a further distinction of recommender sys-tems. In fact, collaborative filtering (CF) in recommender systems can eitherbe implemented as user-based or item-based. In this section, we discuss thesetwo types of CF-based recommender system implementations, before looking ata linear recommendation scheme i.e., the Slope-one algorithm.

Another classification of a collaborative filtering recommender is whether itis a neighborhood method or latent factor model. Generally, all similarity-basedrecommender algorithms are neighborhood methods. Latent factor models rep-resent an alternative approach that transforms users and items to the samelatent space[3]. This section will give a general overview of these methods.

We also introduce recommender system implementations using Apache™Mahout® (or just Apache Mahout), a Java-based scalable open-source datamining and machine learning library build on top of a distributed computingframework called Apache™ Hadoop®. Both of these libraries are used by anumber of large organization such as Facebook, Netflix, Tweet, LinkedIn to

14

Column 1 Column 2 Column 3 Column 4

User ID Item ID Rating Time Stamp

Table 2.2: Data format accepted by the Apache Mahout data model

solve various machine learning problems. Refer to Appendix A for a furtherdiscussion of Apache Mahout.

2.2.1 User-based CF

User-based recommender systems predict a user rating on a new item based onratings given by users with a similar taste. This is the type of recommendationthat was referred to when collaborative filtering (CF) was discussed in subsec-tion 1.1.1. In fact, it is the traditional approach to the implementation of CFrecommender systems.

To recommend items to the current user c, the first step in user-based rec-ommenders is to calculate the similarity between all users in the system usinga similarity measure, such as Pearson correlation (equation 2.3). Next, theresult is fed in to a neighborhood function, such as thresholding or k-NearestNeighbors(kNN)[20] that picks the top k users who exhibit the highest similarityto user c. These are then used in an aggregation function, most likely equation1.3, to predict ratings for user c on all items s ∈ S′where |S′| ≦ |S| and S isthe set of all items in the system where as S′ denotes the set of items rated byall users in the neighborhood of user c. Item set S′is then sorted according tothe predicted ratings starting with the highest. Finally, the top N items arepresented to user c as recommendations.

In Apache Mahout, the user-based recommendations can be calculated asshown in Listing 2.1.

Listing 2.1: User-based recommendation in Apache Mahout

1 DataModel model = new FileDataModel (new File(" dataset .csv"));

2 UserSimilarity similarity = new PearsonCorrelationSimilarity ( model

);

3 UserNeighborhood neighborhood = new NearestNUserNeighborhood (

neighborhoodSize , similarity , model );

4 Recommender recommender = new GenericUserBasedRecommender (model ,

neighborhood , similarity );

Listing 2.1 shows all the steps discussed in this subsection. The objectDataModel is built from data stored in a comma-separated values (CSV) filebut it may also come from a database. The data model in Apache Mahoutaccepts data stored in the format shown in table 2.2. Pearson correlation hasbeen used here as a similarity measure but Apache Mahout also implementsother similarity measures such as Cosine.

The challenge faced by user-based recommendation is that user similarities

15

change often. In a large system that processes thousands of users per day, thereis very little chance that the calculated user neighborhood will remain the sameeven within a short period. This makes user-based recommender systems hardto scale [21]. The training has to be done often, if not in real-time. On theother hand, real-time recommendation in user-based recommenders becomesimpractical as the data set grows since all the stored ratings have to be used inorder to calculate the neighborhood.

2.2.2 Item-based CF

Item-based recommenders calculate recommendations based on ratings given tosimilar items[3]. The steps taken in calculating the recommendations are essen-tially the same as in subsection 2.2.1. The similarity measures in user-basedrecommendation also apply here, except the similarity compares the rating val-ues given by many users on one item, rather than by one user on many items[19].

As shown in listing 2.2, it is worth noting that no neighborhood is calculatedin item-based recommendation. This is so because the recommendation processbegins with an already limited number of items i.e. items already rated by theuser for whom the recommendations are to be calculated. This, in a sense isalready a neighborhood and it would be pointless to calculate another one.

Listing 2.2: Item-based recommendation in Apache Mahout

1 DataModel model = new FileDataModel (new File(" dataset .csv"));

2 ItemSimilarity similarity = new PearsonCorrelationSimilarity ( model

);

3 Recommender recommender = new GenericUserBasedRecommender (model ,

similarity );

The fact that item-based recommenders start with a smaller data set thanuser-based ones means that they are generally faster. Moreover, the number ofitems on offer does not change often.

2.2.3 Slope-One Algorithm

Slope-One is the most basic recommender system algorithm. It is attractive touse because it is fast and works well for real-time recommendation [21] . It is avariant of item-based recommendation.

The name ’Slope-One’ comes from the fact that the scheme has the formf(x) = x + b, where x represents a rating value and b is a constant. Formally,[22] shows that a user’s predicted ratings using the Slope-One scheme can becalculate as follows: given a user c and an item s, let u(cj , s) be all the ratings

given to item s by each user cj where ∀j : 0 ≤ j ≤ N : cj ∈ C , C is the set of all

users who have rated item s and N is the total number of users in set C. The

16

predicted rating u(c, s) of item s by user c can then be calculated as follows:

u(c, s) = u(c) +1

N

∑

cj∈C

[u(cj , s) − u(cj)] (2.4)

where value u(c) is the mean rating of all items rated by user c and u(cj) is themean rating of each user cj . In essence, the scheme predicts a user’s rating onan item based on the user’s average rating plus the average deviation from themean for the item across all the users that have rated it.

Apache Mahout originally had a Slope-One algorithm implementation butit was removed in version 0.8 (the current version is 0.9) due to lack of usage.While usage of the Slope-One algorithm has reduced over the years due to its lowaccuracy (high root mean square error (RMSE)), works such as [22] have shownthat it can be a good candidate recommender algorithm for online recommendersand, in general, recommenders that are more concerned about performance.

2.2.4 Latent Factor Models

Latent factor models project the user-item interaction on to a lower dimensionfeature space[23]. The aim is to explain the more hidden details in the user-iteminteraction; to reveal interesting aspects of user ratings beyond their numericalvalue[3]. Examples of latent factor models used in recommender systems includeLatent Dirichlet Allocation[24], neural networks[25] and matrix factorization (orsingular value decomposition (SVD)).

While latent factor models are well established in the area of informationretrieval, their usage in recommender systems is relatively new. Traditionalrecommender systems, such as the ones discussed so far, rely on explicit feedbacksuch as ratings or likes. Latent factor models, on the other hand, can calculaterecommendations based on implicit feedback such as purchase history, searchpatterns, or mouse clicks.

In Apache Mahout, the SVD recommendations can be calculated as shownin Listing 2.3.

Listing 2.3: Matrix factorization recommendation in Apache Mahout

1 Factorizer factorizer = new SVDPlusPlusFactorizer ( dataModel ,

numFeatures , numIterations );

2 Recommender recommender = new SVDRecommender ( dataModel , factorizer );

3 List < RecommendedItem > topItems = recommender . recommend (userID ,10);

Though latent factor models had trouble in integrating into the recommenderalgorithms space, they have been shown to provide better accuracy than themuch more popular neighborhood methods. [26], [3] and [27] give a detaileddiscussion of the latent factor models, as used in recommender systems.

17

2.3 Recommender Systems at Netflix

Netflix, the Internet-based movie streaming firm, is one of the pioneers of thecurrent research and interest in recommender systems.

Starting form 2006, they had been running a prize competition aimed atimproving the performance of their movie recommendation algorithms.The lastedition of the competition was run in 2009 and the winning solution is docu-mented in [27]. The performance measurement criteria used was RMSE. Thewinning solution had an RMSE that was 10% better than Netflix’s own algo-rithm, on a Netflix data set consisting 100 million movie ratings collected from1999 to 2006. The solution was a hybrid of more than 100 different recommenderalgorithms, from both latent factor models and neighborhood methods. It turnsout that only a few aspects of the final solution were incorporated into Netflix’srecommender system. Netflix report that their offline tests on the winning so-lution showed high computational costs and low scalability (the solution wasdesigned on 100 million ratings where as Netflix now has a data set with morethan 5 billion ratings). Without doubt, one lesson drawn from the competitionis that a well structured hybrid recommender algorithm will definitely providebetter accuracy than any individual algorithm.

Today, movie recommendation at Netlflix is much more complex than just asimple hybrid of individual algorithms. The current architecture is best summedup in [1]. As figure 2.1 shows, it is a multi-layered architecture with differentservices (Apache Hadoop, MySQL, algorithm service etc.) running at each layer.The layers are online, nearline and offline.

User input is collected by the Event Distribution service and goes beyondmere movie ratings; it includes other user behavior and temporal informationsuch as what movie a user plays, time of day, search patterns and so on.

The computation of recommendations is done at each layer. As such, eachlayer can independently produce recommendations to users.

The final recommendations presented to the user are a combination of resultsfrom all three layers. This ensures that the recommendation are fresh, reflectingthe current user preference pattern (online computation) but also utilizing in-formation from the user’s recent actions (nearline computation) and knowledgegained from all the data stored about the user as well as other users (offlinecomputation).

2.4 Evaluating Recommender Systems

Evaluation of recommender systems traditionally focuses on prediction accu-racy[28]. Thus, a recommender system that calculates predicted ratings will beevaluated based on how close the predictions are to the actual ratings givenby users. Performance evaluation methods used in this regard include rootmean square error (RMSE), mean absolute error (MAE), precision, recall andF-measure. These methods do not take in to account the order in which recom-mended items are accessed by the user.

18

Figure 2.1: Overall architecture of the Netflix movie recommender system(Source: Netflix, Inc [1])

19

Position Movie Recommendations Predicted Rating

1 The Matrix 52 Shutter Island 4.53 Lion King 4.54 The Godfather 45 Memento 3

Table 2.3: Example movie recommendations sorted in descending order of pre-dicted ratings

Consider table 2.3. The table shows computed movie recommendations froma recommender system. The recommendations are presented to the user indescending order of predicted ratings (ratings are in the range [0,5]). If therecommender system is to be evaluated using, say RMSE, the prediction errorwill be the same regardless of whether the user decides to give an actual rating of3 to either Shutter Island or Lion King, since they both have the same predictedrating (4.5).

In some systems, the order in which a user accesses recommended items isimportant. Examples of such systems may include movies, news, search enginesand article recommendation websites. In such systems, rank-based evaluationtechniques may be used. Examples include normalized discounted cumulativegain (NDCG), normalized distance-based performance measure (NDPM) andSpearman’s rank correlation.

We will now discuss two commonly used accuracy-based evaluation tech-niques (RMSE and MAE) and one rank-based evaluation technique (NDCG).NDCG will later be used in evaluating the adaptive predictor proposed in thisproject.

2.4.1 RMSE and MAE

RMSE is probably the most popular method used to evaluate recommendersystems[28]. It measures the accuracy error a recommender algorithm i.e., howmuch the predicted ratings deviate from the actual user ratings on recommendeditems. RMSE tests on an algorithm are normally carried out offline on anexisting data set which is divided into training and test sets (e.g. 80% trainingset, 20% test set) [20]. The recommender algorithm will then be trained on thetraining set and then its RMSE calculated using the test set.

Formally, the RMSE of a recommender algorithm can be represented asfollows: let S be a test set with actual user item ratings u(c, s) for each user-item pair (c, s). Also let u(c, s) be the predicted rating for the user-item pair,calculated by the trained recommender algorithm. The RMSE of the algorithmcan be computed as shown in 2.5

RMSE =

√

√

√

√

1

|S|

∑

(c,s)∈S

[u(c, s) − u(c, s)]2 (2.5)

20

and the MAE is calculated as in equation 2.6.

MAE =1

|S|

∑

(c,s)∈S

|u(c, s) − u(c, s)| (2.6)

Since both of these techniques are error measures, the aim is obviously toget values that are as small as possible.

2.4.2 Normalized Discounted Cumulative Gain (NDCG)

Discounted Cumulative Gain (DCG) is a rank-based evaluation technique usedto evaluate the relevance of items in an ordered recommendation list[29][30].With roots in information revival, this method is mostly popular for evaluatingweb search results, which in a way can also be classified as recommendations.

DCG works on two assumptions. These are:

1. Highly relevant items are more useful than less relevant ones

2. Relevant items with lower ranked positions are less useful to the user sincethey are less likely to be examined.

Given a ranked recommendation list, the DCG of an item at position p can becomputed as in equation 2.7[31][30].

DCGp =

p∑

i=1

2reli − 1

logb(1 + i)(2.7)

The variable rel is the relevance value assign to the item at position p bythe user. It corresponds to a preference value such as a rating. The numerator2reli − 1 is the “gain” of the recommended item at position i ∈ p and is dis-counted logarithmically. The logarithmic base b can be any value in the range[2, 10]. Base 2 is the most commonly used. To account for the fact that therecommendation lists may vary in length, the DCG value is usually normalized.The normalized DCG (NDCG) is calculated as shown in equation 2.8.

NDCGp =DCGp

IDCGp

(2.8)

NDCG is in the range [0,1]. Unlike prediction accuracy-based evaluationtechniques where the aim is to minimize the result, rank-based techniques aimto maximize it. As such, NDCG values close to 1 are better than those close to0. IDCGp in equation 2.8 represents the ideal DCG at position p. In a perfectrecommendation, DCGp at each position will be equal to the correspondingIDCGp. As such the NDCGpwill be equal to 1.

21

Chapter 3

The Adaptive Real-timeMovie Recommender

This project explores a novel approach to the implementation of recommendersystems. This chapter will now present this approach in detail. Using movierecommendation as an example, the adaptive predictor originally presented in[13] is set up to provide movie recommendations to users based on their pastinterests and the interests of other users.

This chapter begins with a formal presentation of the adaptive predictorin the context of movie recommendation. It then explains the implementationdetails of the experimental system designed to investigate the approach.

3.1 Setup

Let the set of all movies on offer in a movie recommender system be representedas follows:

Σ = {σ0, σ1, ..., σm} (3.1)

where m+1 is the total number of movies on offer and the symbol σ0 is reservedfor “not a movie”. Let all information related to the t-th user be represented asut which is an element in an appropriate domain D. A sequence of users can berepresented as follows:

(u1, u2, u3, ..., un) (3.2)

Except for this ’input signal’, there is an ’output signal’ of the choice ofitems clicked on by the respective user. That is, a sequence

(σ1, σ2, σ3, ..., σn)

such that user ut clicked on the recommended item σt ∈ Σ. Note that, if a userut did not follow any recommendations at all, its corresponding outcome wouldbe σt = σ0. It is also inherent in the choice of notation that to each user weassociate exactly one item.

22

3.2 The Adaptive Predictor

A key idea in the implementation of the adaptive predictor algorithm is torepresent the items recommended to a user ut not as a finite set, but as a

‘probability vector’ p(ut) over Σ, such thatm∑

i=1

pi(ut) = 1. Also 0 ≤ pi(ut) ≤ 1

for all i = 1...m and ut ∈ D. The following interpretation can be given to suchprobability vectors:

pi(ut)= The probability that user ut clicks on item σi if presented.The constraint that the vector has to sum up to 1 implies that we assume

that all m items should be presented. The items for the user ut are orderedproportional to pi(ut) for i = 1, ..., m. In practice, only the first few elementsof this sorted list are recommended to the user. The aim of our recommendersystem is then to maximize this function p : D → R

m such that pi(ut) is ina certain sense optimal. In the remainder of this document, we will describein which precise sense we need to optimize p, and how to solve effectively theresulting optimization problem.

3.2.1 Solution

A mathematical solution to this problem goes as follows: we maintain a matrixW ∈ R

m×D which scores a probability vector p ∈[0, 1]m for given side informa-tion u ∈ D as follows:

pi(u) =exp(Wiu)

m∑

j=1

exp(Wju)(3.3)

where Wi denotes the ith row of the matrix W, taking values in D. Note thatWiu denotes then the inner product between two elements of D. This rulegives a convenient mapping from side-information u onto a probability vectorp(u) which in turn translates into a recommendation (see discussion above) bysorting the entries of p(u). Then, we look for a W∗ which solves the followingoptimization problem:

W = arg maxW

−

n∑

t=1

ln(pit(ut)) (3.4)

where it ∈ {0, ..., m} is the index corresponding with the item σt actually clickedon at time t. The reason for using this specific logarithmic loss can be thoughtof intuitively as follows: the utility (pay-off) of a user clicking on an item σi

when predicted to do so with probability pi is proportional to −ln(pi).Now, the question of how to find a W which actually solves the optimization

problem as good as possible is addressed. The following recursion will be used:

Wt = Wt−1 + γ(eit − pt−1(ut))uTt (3.5)

23

where W0 = 0, and for all i = 1, ..., m one has

pi,t−1(u) =exp(Wi,t−1ut)

m∑

j=1

exp(Wj,t−1ut)(3.6)

The parameter gamma(γ > 0) has to be tuned appropriately. Here, eit ∈ {0, 1}m

corresponds to the unit vector in Rm where all entries equal zero, except the

entry corresponding to the symbol σt. This scheme is effectively implementinga gradient descent method, and is shown to find the optimal matrix W for thetask at hand, provided that γ > 0 is well tuned. The next two subsection givea summary of the proposed adaptive predictor algorithm and how to tune thegamma value, respectively.

3.2.2 The Adaptive Predictor Algorithm

Given a set of m movies, the steps taken to calculate recommendations usingthe AP scheme are summarized in Algorithm 3.1.

Algorithm 3.1 The Adaptive Predictor (AP) algorithm

1: for t = 0 ... ∞ do

2: Compute user side information u

3: Compute probability vector p (as in equation 3.6)4: Sort items based on vector p

5: Recommend item(s) with highest probability to user6: Collect user feedback (e.g movie rating)7: Update user side information8: Update Matrix Wt (as in equation 3.5)9: end for

The variables u and p are m × 1 vectors with each vector element corre-sponding to an item in the movie set. These variables are computed at eachtime t that a user accesses the system and they are independent of each user. Onthe other hand, W is a global m × m matrix that acts as the “memory” of thesystem and is updated every time any user updates their side information. It isavailable to all users and influences what movies end-up being recommended tothem.

3.2.3 Tuning the gamma value

Given that ri is a rating given to item i ∈ {0, ..., m} by a user and pi is theprobability associated to the item by the adaptive predictor, then the optimalvalue of γ is the value of γ atmax(rQ) where rQ is computed using equation3.7. The optimal γ can be found by plotting values of rQ against values of γ inthe logarithmic scale. In the resulting graph, pick the γ value at a maximum

24

point. Figure 4.1 illustrates what the plot will look like.

rQ =1

N

N∑

i=1

ln(ripi) (3.7)

The value N is the total number of items.

3.3 The Movie Recommender

To test the adaptive predictor on its ability to drive a recommender system, amovie recommender application was developed. The application was primarilymodeled after Algorithm 3.1. It was developed using the Java programminglanguage and relied heavily on the Apache Mahout library1 for math opera-tions and recommendation-related tasks. The whole project was managed usingApache Maven[32].

The application was primarily developed with a console interface. Later, aweb interface was added to it to create a web-based movie recommender thatwas then tested on real users. The website was hosted on a free cloud service onthe World Wide Web (refer to Appendix B for details). We now discuss thesetwo versions of the application.

3.3.1 Console Application

The console application was used to carry out all the offline tests. The adaptivepredictor algorithm can be divided in to two(2) tasks: (1) compute recom-mendations and (2) update user side information and global matrix W. Codelistings 3.1 and 3.2 show how tasks were implemented in the movie recommenderapplication.

The sorted recommendation list returned in Listing 3.1 contains all movie IDsin the system sorted in decreasing order of corresponding probabilities. Thesemovie IDs are used to retrieve the associated movie information (title, year,synopsis, trailer) which is presented as movie recommendations to the currentuser, in order. Once a user rates one movie, the code in Listing 3.2 is invoked toupdate the user’s information and the global matrix W. After that, the processstarts again. The constant MAX_RATING is the chosen maximum rating foritems in the system. In the movie recommender application, that value was 5.Thus ratings ranged from 0 (lowest) to 5 (highest).

3.3.2 The Web Application

The main aim of the movie recommender web application was to test the adap-tive predictor in a real recommender system environment on actual users. Thiswas build on top of the console application and used the following web tech-nology: HTML/CSS/JQuery/JSP for the client-side and Java Servlets for the

1Apache Mahout was introduced in section 2.2

25

Listing 3.1: Compute user side information, probability vector, sorted recom-mendation list

1 /*

2 * Compute user -side information ( retrieve from DB)

3 */

4 uVector . assign ( RetrieveDao . retrieveUserInfo ( userID );

5

6 /*

7 * Compute probability vector

8 */

9 Matrix tempVector = new DenseMatrix ( numOfItems ,1);

10 tempVector = W. times ( uVector ); //W*u

11 for(int i=0;i< numOfItems ;i++)

12 {

13 pVector .set(i,0, Math.exp( tempVector .get(i, 0))); // exponent (W*u

)

14 }

15 pVector = pVector . divide ( pVector .zSum ()); //( exponent (W*u))/ sumAll (

exponent (W*u))

16

17 /*

18 * Prepare recommendations

19 *( Assign all values in vector p to recommendation vector r and

sort vector r with corresponding item IDs)

20 */

21 rVector . assign ( pVector );

22 sortedRecommendationList . addAll ( insertionSortrVector (rVector ,

itemIDs ));

23 return sortedRecommendationList ;

server side. The servlets invoked classes from the console application. User in-formation, as well as the collected feedback, was stored in a MySQL database.

The web application collected two(2) kinds of user feedback. These are:

1. ratings of recommended movies, and

2. rating of the overall quality of the recommendations provided by the sys-tem

As such the system had two (2) use-cases:

1. Use case 1: Movie rating

2. Use case 2: System quality rating

These are illustrated in Figure 3.1 and Figure 3.2 respectively. The user is askedto rate at least 15 movies. The list of movies presented to the user is updatedusing the adaptive predictor every time they rate a movie. This is so becausethe adaptive predictor scheme is designed to compute recommendations onlineand as such requires updating every time user feedback is received.

26

Figure 3.1: Flow chart showing interaction between the user and the movierecommender web application (Use case 1)

27

Figure 3.2: Flow chart showing interaction between the user and the movierecommender web application web application (Use case 2)

28

Listing 3.2: Update user side information and global matrix W

1 double gamma = Math.pow (10 , -1.4); // This is the calculated optimal

value of gamma

2

3 /*

4 * Update user side information ( vector uVector and user profile in

the database )

5 */

6 uVector .set( itemIDs . indexOf ( currentMovie ) ,0,( double ) rating / updater .

MAX_RATING );

7 StoreDao . storeUserInfo (userID , itemIDs .get( currentMovieIndex ),rating

);

8

9 /*

10 * Update matrix W

11 */

12 fVector .set( itemIDs . indexOf ( currentMovie ) ,0,( double )1);

13 tempVector = fVector . minus ( pVector ); // feedback vector (e) -

probability vector (p)

14 tempVector = tempVector . times ( gamma ); // gamma *(e-p)

15 W = W.plus( tempVector . times ( uVector . transpose ())); //W+ gamma *(e-p)*

transpose (u)

16 StoreDao . storeMatrixW (W); // Store Matrix W in database before

execution ends

17 return W;// Return updated W to compute new recommendations

Dataset 1 Dataset 2

Movies 1682 167Users 943 62Ratings 100,000 1080

Table 3.1: Summary of quantitative information about data sets used

3.4 Data Sets

While a number of small, artificially created data sets were used to test andfine-tune the movie recommender during the implementation process, only twodata sets were used for all the test results presented in this report. These aresummarized in Table 3.1.

Dataset 1 is a publicly available data set from MovieLens[33]. Dataset 2is the rating list collected from the online experiment using the movie recom-mender web application developed for this project.

Each data set is a list with the following column attributes: User ID, MovieID, Rating, Time stamp. This format is consistent with that accepted by theApache Mahout data model.

29

3.5 Evaluation

Evaluation of the adaptive predictor algorithm was done using NDCG. Thedecision to user a rank-based performance evaluation scheme was influenced bythe fact that the algorithm does not predict ratings. Also there is no directcorrelation between the probability associated with an item (as calculated bythe adaptive predictor2) and the actual rating given to the item by the user.

2Refer to Section 3.2

30

Chapter 4

Experimental Results

4.1 Optimal Gamma Value

In this experiment, three different subsets of Dataset 1 where used to determinethe optimal value of gamma (γ > 0), as outlined in Section 3.2.3. The averageoptimal value of γ obtained from the 3 tests is ∼ 10−1.4. Figure 4.1 shows anplot using one subset of Dataset 1 with 275 movie ratings.

The optimal value of γ is obtained by determining the x − coordinate of themaximum point on the graph and raising it as a power of 10.

4.2 NDCG

NDCG was the main evaluation technique used in this work.The adaptive pre-dictor was evaluated offline against two other algorithms: user-based and item-based recommenders. Both Dataset 1 and Dataset 2 were used in the evaluation.

The two recommenders, user-based and item-based, required training in or-der to give recommendations. Fortunately, the makers of Dataset 1 (MovieLens[33])also made an 80%/20% split of the data set in to training and test set respec-tively. Thus, the NDCG values of the two algorithms were calculated by firsttraining the algorithms using Dataset 1 ’s training set(80%) and then using thetest set (80%) to compute the values.

The adaptive predictor did not require any training. As such only the testset was used to calculate its NDCG values from Dataset 1. Figure 4.2 shows aplot of the test results. Similar results could not be obtained using Dataset 2because, as shown in Table 4.1, both user-based and item-based recommendergave 0 for NDCG values. This is because Dataset 2 was too small for thecollaborative-based recommender systems to generate any recommendations.

As discussed in Sub-section 2.4.2, NDCG values are always in the range [0,1].Values close to 1 indicate better performance than those close to 0.

31

Figure 4.1: Example plot of rQ against gamma(γ = 10x)

Figure 4.2: Adaptive predictor vs Item-based and User-based recommenders,using the first 100 ratings from the test set of Dataset 1

32

Figure 4.3: Adaptive predictor vs Item-based and User-based recommenders,using the first 1000 ratings from the test set of Dataset 1

Algorithm Dataset 1 Dataset 2

Adaptive Predictor 0.1101 0.3042User-based recommender 0.0958 0Item-based recommender 0.0778 0

Table 4.1: Average NDCG values for adaptive predictor, user-based and item-based recommenders using the first 100 ratings from the test set of Dataset 1and the whole of Dataset 2.

33

Figure 4.4: Chart showing number of movies rated by users who gave high

recommendations quality ratings

4.3 User Feedback on Recommendations Qual-ity

In the movie recommender web application, in addition to rating individualmovies, the logged in user was presented with a top 10 movie recommendationlist and asked to rate the quality of the recommendations. In essence, theywere rating how close they thought the adaptive predictor captured their movie’taste’. Of the 62 users that registered on the web application, 39 gave feedbackon their perceived quality of the recommendations representing a 63% responserate.

The assumption in this component of the study is that users that ratedmore movies should give higher quality ratings. This is so because intuitively,one would think that a good recommender system should get better at knowingwhat the user likes as it acquires more information from the user. On the otherhand, users that have provided little or no information are not expected to ratehighly the quality of the recommendations they receive. Figures 4.4 and 4.5show the actual results obtained.

34

Figure 4.5: Chart showing number of movies rated by users who gave bottom10 low recommendations quality ratings

35

Chapter 5

Evaluation and Analysis

From the results presented, a number of observations were made. We discussthese here.

Generally, the tests show results that are good from a logical point of viewindicating that the adaptive predictor algorithm is suitable for use in drivinga recommender system. Firstly, offline test runs using the console applicationand Dataset 1 indicated that those movies which were rated more often receivedhigher probabilities than those that were not. Also, the probability value asso-ciated with a movie “decayed” the longer the movie stayed without being rated.This behavior is as expected since, as mentioned in Section 3.2, the vector p is

designed in such a way that all its elements sum up to 1 i.e.m∑

i=1

pi(ut) = 1. An

increase in the probability value of one element results in a decrease in the valueof other element(s). Secondly, it was noted that a movie that received a maxi-mum rating at time t − 1 normally ranked highly in the sorted recommendationlist in the next iteration at time t. This is important because the user at t − 1might not be the same user for which the recommendations are computed at t.If the users are different at the two times, then the final recommendation list forthe user at t is influenced by the movie rating by the user at t − 1 (through theglobal matrix W) as well as the user’s own ratings from their previous interac-tion with the recommender. Of course, the value of gamma (γ) also influenced“decay” of probability values. Higher γ values resulted in faster “decay” whereas lower values made it slow. Its no wonder that it was important to computean optimal γ value which, from the test runs, showed that it was decaying theprobabilities in a steady manner.

One important conclusion drawn from the movie recommender web applica-tion experiment is that the quality if the items in the system matters; a goodrecommender system should still be backed by a good item set. A web applica-tion may be backed by the best possible recommender system but if the itemsit has on offer do not accommodate most user preferences or tastes, then thequality of the underlying recommender system will not matter.

In this work, the movie recommender web application had a movie list with

36

327 movies. However, only 167 where actually rated by users even when 1080ratings were collected, as shown in Section 3.4. One explanation is that theunrated movies were of poor quality to most users. But it may also be thattheir probabilities decayed to values that were too small, so that they keptbeing pushed to the tail-end of the sorted recommendation list. If the laterexplanation is the case, then that is a disadvantage with the adaptive predictor:items getting stuck at the bottom of the sorted recommendation list because ofprobability values that are too small. This problem would probably not exist ifother user side information was used (in addition to ratings).

The results from the NDCG evaluation in Section 4.2 indicate that the adap-tive predictor performed slightly better than both the user-based and the item-based recommenders. Its closest rival is the user-based recommender which, forabout 65% of the ratings on the graph in Figure 4.2, closely matched the NDCGvalues of the adaptive predictor. The item-based recommender is not far behindeither. If anything, it looks to be the most consistent of the three algorithms.

In Table 4.1, both user-based and item-based recommenders could not giveany NDCG values using Dataset 2 because it was too small even for training.With Dataset 1 even after training the two algorithms with 80% of the data set(80,000 ratings), they still had worse average NDCG values than the adaptivepredictor. Moreover, the adaptive predictor only saw the first 100 ratings fromthe test set. This highlights an advantage with the proposed adaptive predictionscheme: it relies on much less data and as such it is able to deal well with thecold-start problem1 associated with most recommender system algorithms.

Finally, the assumption in Section 4.3 that users with more movie ratingsshould give higher recommendations quality ratings was proven wrong as can bededuced from Figures 4.4 and 4.5. While the user who rated the least numberof movies comes from among the users who gave the lowest quality ratings, theuser who rated the most number of movies also comes from there. Moreover,the average number of movies rated by both sets of users is about the same(around 17, excluding the user that rated more than a 100 movies in the secondChart, Figure 4.5). This may suggest that the items in the system did not caterfor most users’ preferences. It may also be that the system did not run longenough to learn more about the users (the experiment was conducted over 3.5days).

1Refer to Section 1.2 for definition

37

Chapter 6

Conclusion and FutureWork

6.1 Limitations and Future Work

This thesis work implemented a functional recommender system using a novelapproach. However, further improvements on both the scheme and its imple-mentation are needed.

The investigation was done on a small data set. As such, for future work,the scalability of the algorithm in handling large data sets is worth exploring.Obviously, one of the bottle-necks in the algorithm is the fact that it is doing anumber of matrix multiplications at each iteration. This will definitely not scaleas the item list of the recommender system grows. Future works can explorehow to implement the algorithm on a distributed cluster of computers usingdistributed computing technology, such as Apache Hadoop, so as to speed upthe computations and meet the real-time constraints.

An important concept related to the adaptive predictor that has not beeninvestigate in detail in this work is that of “user side information”. The onlyinformation that was used is ratings. However, other user information suchas gender, age, mood, time of day, weekend/weekday, company (whether theyare with friends or not) and so on may influence what movies they might beinterested in.

One thing that researchers in recommender systems can agree on is the factthat user behavior is complex and dynamic. A user may not like an item todaybut that is no guarantee that they will not like the same item on another day.Neither is there a guarantee that if they gave a high rating on an item today,then will will give it the same rating another time. Therefore, it would also beinteresting to investigate the effect on quality of recommendations when usersare recommended the same item multiple times and are allowed to rate itemsmore than once.

The movie list used in the web application was constructed using top movie

38

titles from popular movie websites. In total, 327 movies were used. This numberof movies is definitely insufficient to capture all user movie preferences. Thisaffected the recommendations quality rating from users. A larger, well-selectedmovie list would probably have produced different results.

Lastly, while the neighborhood methods are still enjoying a lot of popular-ity, trends in recommender systems implementation are shifting towards usageof latent factor models. Thus, it would be worth it to evaluate the adaptivepredictor’s performance in comparison to recommender systems that are basedon these models. This, however, will require a significant change in the set upof the Adaptive Predictor scheme.

6.2 Conclusion

Using a movie recommender system, this work has successfully shown that theadaptive predictor is capable of driving a real-time recommender system. Theperformance evaluation of the algorithm showed the adaptive predictor to be sig-nificantly better than the common recommender algorithms. It has also shownthat large-scale, real-life recommender systems in use today are hybrid in na-ture and their structure is complex, with multiple layers. The most popularrecommender algorithms are neighborhood-based collaborative filters; thoughthere has been growing interest in latent factor models, such as singular valuedecomposition (SVD), since the Neflix Prize competition which was aimed atfinding the best movie recommender algorithm for Netflix, the internet-basedmovie streaming service.

39

Bibliography

[1] X. Amatriain, “Machine Learning and Recommender Systems at Netflix,”in QCon International Software Development Conference. Netflix, 2013.

[2] “Apache Mahout,” 2014, [Date accessed June 4, 2014]. [Online]. Available:http://mahout.apache.org/

[3] P. Kantor, L. Rokach, F. Ricci, and B. Shapira, Recommender SystemsHandbook. Springer Science+Business Media, LLC 2011, 2011. [Online].Available: http://cds.cern.ch/record/1412605

[4] X. Amatriain, “The Science and the Magic of User Feedback for Recom-mender Systems,” Telefonica, Bay Area, San Francisco, Tech. Rep., 2011.

[5] U. Hanani, B. Shapira, and P. Shoval, “Information filtering:Overview of issues, research and systems,” User Modeling andUser-Adapted Interaction, vol. 11, no. 3, 2001. [Online]. Available:http://dl.acm.org/citation.cfm?id=598363

[6] G. Adomavicius and A. Tuzhilin, “Toward the next generation ofrecommender systems: a survey of the state-of-the-art and possibleextensions,” IEEE Transactions on Knowledge and Data Engineering,vol. 17, no. 6, pp. 734–749, Jun. 2005. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1423975

[7] B. Chandramouli and J. Levandoski, “StreamRec: a real-time recommendersystem,” SIGMOD 2011,June 12-16, 2011, Athens, Greece., pp. 6–8, 2011.[Online]. Available: http://dl.acm.org/citation.cfm?id=1989465

[8] L. He and F. Wu, “A time-context-based collaborative filteringalgorithm,” Granular Computing, 2009, GRC’09. IEEE . . . , pp. 209–213,Aug. 2009. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5255130

[9] H. Byström, “Movie Recommendations from User Ratings,” Stanford Uni-versity, Tech. Rep., 2013.

[10] N. N. Liu, L. He, and M. Zhao, “Social temporal collaborative ranking forcontext aware movie recommendation,” ACM Transactions on Intelligent

40

http://mahout.apache.org/

http://cds.cern.ch/record/1412605

http://dl.acm.org/citation.cfm?id=598363

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1423975





Systems and Technology, vol. 4, no. 1, pp. 1–26, Jan. 2013. [Online].Available: http://dl.acm.org/citation.cfm?doid=2414425.2414440

[11] D. Carr, “Netflix Inc.” 2013, [Date accessed May 2, 2014].[Online]. Available: http://topics.nytimes.com/top/news/business/companies/netflix-inc/index.html

[12] I. Pilászy, “Recommending New Movies : Even a Few Ratings Are MoreValuable Than Metadata Categories and Subject Descriptors,” Proceedingsof the Third ACM Conference on Recommender Systems, pp. 93–100.

[13] K. Pelckmans, “An adaptive compression algorithm in a deterministicworld,” Algorithmic Probability and Friends. Bayesian Prediction andArtificial Intelligence, 2013. [Online]. Available: http://link.springer.com/chapter/10.1007/978-3-642-44958-1_23

[14] R. J. Solomonoff, “The Discovery of Algorithmic Probability,” Journalof Computer and System Sciences, vol. 55, no. 1, pp. 73–88, Aug.1997. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0022000097915002

[15] M. Supan, “A Predictive Coder : Theory and Practice,” Master Thesis,Uppsala University, 2012.

[16] G. Jeh and J. Widom, “SimRank: a measure of structural-contextsimilarity,” Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pp. 1–11, 2002.[Online]. Available: http://dl.acm.org/citation.cfm?id=775126

[17] X. Amatriain, “Machine Learning Algorithms at Netflix,” in MLconf-TheMachine Learning Conference, Netflix. Netflix, 2013.

[18] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Recommendersystems for large-scale e-commerce: Scalable neighborhood formationusing clustering,” Proceedings of the fifth international conference oncomputer and information technology. Vol. 1. 2002, 2002. [Online].Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.6985&rep=rep1&type=pdf

[19] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout In Action.Shelter Island, NY 11964: Manning Publications Co., 2011. [Online].Available: https://sisis.rz.htw-berlin.de/inh2011/12399459.pdf

[20] O. Zaïane, “Introduction to data mining,” WP Co, 1999. [Online].Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.2540

[21] D. Musto, Cataldo Ph, “Apache Mahout - Tutorial,” Dipartimento di In-formatica - Università degli Studi di Bari, Bari, Italy, Tech. Rep., 2014.

41

http://dl.acm.org/citation.cfm?doid=2414425.2414440

http://topics.nytimes.com/top/news/business/companies/netflix-inc/index.html

http://topics.nytimes.com/top/news/business/companies/netflix-inc/index.html

http://link.springer.com/chapter/10.1007/978-3-642-44958-1_23


http://linkinghub.elsevier.com/retrieve/pii/S0022000097915002

http://linkinghub.elsevier.com/retrieve/pii/S0022000097915002


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.6985&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.4.6985&rep=rep1&type=pdf

https://sisis.rz.htw-berlin.de/inh2011/12399459.pdf

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.2540

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.2540

[22] D. Lemire and A. Maclachlan, “Slope One Predictors for OnlineRating-Based Collaborative Filtering.” SDM, 2005. [Online]. Available:http://epubs.siam.org/doi/abs/10.1137/1.9781611972757.43

[23] S. Schelter and S. Owen, “Collaborative filtering with ApacheMahout,” Proc. of ACM RecSys Challenge, 2012. [Online]. Available:http://www.researchgate.net/publication/235899480_Collaborative_Filtering_with_Apache_Mahout/file/9fcfd513f3719617d7.pdf

[24] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet Allocation,” TheJournal of machine Learning research, vol. 3, pp. 993–1022, 2003. [Online].Available: http://dl.acm.org/citation.cfm?id=944937

[25] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted Boltzmannmachines for collaborative filtering,” Proceedings of the 24th internationalconference on Machine learning - ICML ’07, pp. 791–798, 2007. [Online].Available: http://portal.acm.org/citation.cfm?doid=1273496.1273596

[26] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for rec-ommender systems,” Computer, pp. 30–37, 2009. [Online]. Available: https://datajobs.com/data-science-repo/Recommender-Systems-Netflix.pdf

[27] R. Bell, Y. Koren, and C. Volinsky, “The Bellkor 2008 Solution to theNetflix Prize,” Statistics Research Department at ATT, no. 12, pp. 1–21,2008. [Online]. Available: http://signallake.com/innovation/Bellkor2008.pdf

[28] G. Shani and A. Gunawardana, “Evaluating recommendation systems,”Recommender systems handbook, pp. 1–41, 2011. [Online]. Available:http://link.springer.com/chapter/10.1007/978-0-387-85820-3_8

[29] M. Weimer and A. Karatzoglou, “Maximum Margin Matrix Factorizationfor Collaborative Ranking,” Advances in neural information processingsystems, pp. 1–8, 2007. [Online]. Available: https://papers.nips.cc/paper/3359-cofi-rank-maximum-margin-matrix-factorization-for-collaborative-ranking.pdf

[30] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of IRtechniques,” ACM Transactions on Information Systems, vol. 20, no. 4,pp. 422–446, 2002. [Online]. Available: http://dl.acm.org/citation.cfm?id=582418

[31] Kaggle, “Normalized Discounted Cumulative Gain,” 2013, [Date accessedJune 3, 2014]. [Online]. Available: https://www.kaggle.com/wiki/NormalizedDiscountedCumulativeGain

[32] F. Miller, A. Vandome, and J. McBrewster, “Apache Maven,” 2010, [Dateaccessed June 3, 2014]. [Online]. Available: http://maven.apache.org/

42

http://epubs.siam.org/doi/abs/10.1137/1.9781611972757.43

http://www.researchgate.net/publication/235899480_Collaborative_Filtering_with_Apache_Mahout/file/9fcfd513f3719617d7.pdf

http://www.researchgate.net/publication/235899480_Collaborative_Filtering_with_Apache_Mahout/file/9fcfd513f3719617d7.pdf


http://portal.acm.org/citation.cfm?doid=1273496.1273596

https://datajobs.com/data-science-repo/Recommender-Systems-Netflix.pdf

https://datajobs.com/data-science-repo/Recommender-Systems-Netflix.pdf

http://signallake.com/innovation/Bellkor2008.pdf

http://signallake.com/innovation/Bellkor2008.pdf


https://papers.nips.cc/paper/3359-cofi-rank-maximum-margin-matrix-factorization-for-collaborative-ranking.pdf





https://www.kaggle.com/wiki/NormalizedDiscountedCumulativeGain

https://www.kaggle.com/wiki/NormalizedDiscountedCumulativeGain

http://maven.apache.org/

[33] “MovieLens Data Sets,” 2006, [Date accessed June 4, 2014]. [Online].Available: http://grouplens.org/datasets/movielens/

43

http://grouplens.org/datasets/movielens/

Appendix A

Apache Mahout

Apache Mahout (or just Mahout) is a Java-based scalable data mining and ma-chine learning library build on top of Apache™ Hadoop®, a distributed comput-ing platform. Mahout is an open source project maintained by a growing teamof volunteers. In its current implementation, Mahout supports mainly three usecases[2][21]: (1) recommendation (only collaborative filtering), (2) clusteringand (3) classification. Within these use cases, Mahout implements a numberof machine learning algorithms. Each algorithm implementation is a single ma-chine, distributed or both. If the implementation is distributable, it means itscomputations can be run on a cluster of computers using Apache Hadoop. Someof the implemented algorithms are listed below:

Collaborative Filtering

• User-Based Collaborative Filtering (Single machine)

• Item-Based Collaborative Filtering (Single machine/Distributed)

• Matrix Factorization with Alternating Least Squares (Single machine/Dis-tributed)

• Matrix Factorization with Alternating Least Squares on Implicit Feedback(Single machine/Distributed)

• Weighted Matrix Factorization, SVD++, Parallel SGD (Single machine)

Classification

• Logistic Regression - trained via SGD (Single machine)

• Naive Bayes/ Complementary Naive Bayes (Distributed)

• Random Forest - (Distributed)

• Hidden Markov Models (Single machine)

44

• Multilayer Perceptron (Single machine)

Clustering

• k-Means Clustering (Single machine/Distributed)

• Fuzzy k-Means (Single machine/Distributed)

• Streaming k-Means (Single machine/Distributed)

• Spectral Clustering (Distributed)

Of course, in this thesis work, we were specifically interested in the collabora-tive filtering algorithms in Mahout. Mahout has five top level packages whichinteract compute recommendations. The packages are listed below:

1. DataModel

2. UserSimilarity

3. ItemSimilarity

4. UserNeighborhood

5. Recommender

The packages are written both as interfaces (from where someone can createtheir own custom recommendation engine) and actual implementations of theinterfaces. All the interfaces of the top-level packages are implementented inorg.apache.mahout.cf.taste.impl.

Figure A.1 illustrates the overall architecture of the user-based recommenda-tion engine in Mahout. The same architecture applies to item-based recommen-dation except their is no neighborhood component (reason for this was explainedin 2.2.2). The Recommender is the core abstract class for producing recommen-dations in Mahout. All recommender algorithms implemented in Mahout sub-class the Recommender class. These include GenericUserBasedRecommender,GenericItemBasedRecommender and SV DRecommender.

45

Figure A.1: Overall Architecture of the Apache Mahout User-based Collabora-tive Filtering Engine (Source: Apache Mahout Official Website[2])

46

Appendix B

The Movie RecommenderWebsite

The web-based movie recommender is hosted at the following URL: [http:

//thesisapp.herokuapp.com/, online as of 2014-06-04]. All the source codehas been published to GitHub [https://github.com/KundaJ/web-movie-rec.

git].The following figures show Screen shots of the Web pages.

47

http://thesisapp.herokuapp.com/

http://thesisapp.herokuapp.com/

https://github.com/KundaJ/web-movie-rec.git

https://github.com/KundaJ/web-movie-rec.git

Figure B.1: Screen shot of the Login page [index page]

48

Figure B.2: Screen shot of the user Sign Up page

49

Figure B.3: Screen shot of the page for presenting movie recommendations tousers and getting movie ratings from them

50

Figure B.4: Screen shot of page for gathering users’ overall quality rating of theadaptive predictor algorithm

51

Documents

Building and Evaluating an Adaptive Real-time …uu.diva-portal.org/smash/get/diva2:758781/FULLTEXT01.pdfBuilding and Evaluating an Adaptive Real-time Recommender System Jeff Nkandu