Recommendation Engine Powered by Hadoop

Recommendation Engine Powered by Hadoop

Pranab [email protected]

August 11th 2011 Meetup

About me

• Started with numerical computation on main frames, followed by many years of C and C++ systems and real time programming, followed by many years of java, JEE and enterprise apps

• Worked for Oracle, HP, Yahoo, Motorola, many startups and mid size companies

• Currently Big Data consultant using Hadoop and other cloud related technologies

• Interested in Distributed Computation, Big Data, NOSQL DB and Data Mining.


Hadoop

• Power of functional programming and parallel processing join hands to create Hadoop

• Basically parallel processing framework running on cluster of commodity machines

• Stateless functional programming because processing of each row of data does not depend upon any other row or any state

• Divide and conquer parallel processing. Data gets partitioned and each partition get processed by a separate mapper or reducer task.


More About Hadoop

• Data locality, at least for the mapper. Code gets shipped to where the data partition resides

• Data is replicated, partitioned and resides in Hadoop Distributed File System (HDFS)

• Mapper output: {k -> v}. Reducer: input {k -> List(v)} Reducer output {k -> v}

• Many to many shuffle between mapper output and reducer input. Lot of network IO.

• Simple paradigm, but surprising solves an incredible array of problems.


Recommendation Engine

• Does not require an introduction. You know it if you have visited Amazon or Netflix. We love it when they get right, hate it otherwise.

• Very computationally intensive, ideal for Hadoop processing.• In memory based recommendation engines, the entire data

set is used directly e.g collaboration filtering, content based recommendation engine

• In model based recommendation, a model is built first by training the data and then predictions made e.g., Bayesian, Clustering


Content Based recommendation

• A memory based system, based purely on the attributes of an item only

• An item with p attributes is considered as a point in a p dimensional space.

• Uses nearest neighbor approach. Similar items are found using distance measurement in the p dimensional space.

• Useful for addressing the cold start problem i.e., a new item in introduced in the inventory.

• Computationally intensive. Not very useful for real time recommendation.


Model Based Recommendation

• Based on traditional machine learning approach• In contract to memory based algorithms, creates a

learning model using the ratings as training data.• The model is built offline as a batch process and

saved. Model needs to be rebuilt when significant change in data is detected.

• Once the trained model is available, making recommendation is quick. Effective for real time recommendation.


Collaboration Filter

• In collaboration filtering based recommendation engine, recommendations are made based not only the user’s rating but also rating by other users for the same item and some other items. Hence the name collaboration filtering.

• Requires social data i.e., user’s interest level for an item. It could be explicit e.g., product rating or implicit based on user’s interaction and behavior in a site.

• More appropriate name might be user intent based recommendation engine.

• Two approaches. In user based, similar users are found first. In item based, similar items are found first.


Item Based or User Based?

• Item based CF is generally preferred. Similarity relationship between items is relatively static and stable, because items naturally map into many genres.

• User based CF is less preferred, because we humans are more complex than a laptop or smart phone (although some marketing folks may disagree). As we grow and go through life experiences, our interests change. Our similarity relationship in terms of common interests with other humans is more dynamic and change over time


Utility Matrix

• Matrix of user and item. The cell contains a value indicative of the users interest level for that item e.g., rating. Matrix is sparse

• The purpose of recommendation engine is to predict the values for the empty cells based on available cell values

• Denser the matrix, better the quality of recommendation. But generally the matrix sparse.

• If I have rated item A and I need recommendation, enough users must have rated A as well as other items.


Example Utility Matrix

User | Item i1 i2 i3 i4 i5

u1 r12 r14 r15

u2 r21 r22 r25

u3 r32 r34

u4 r43 r45


Rating Prediction Example

• Let’s say we are interested in predicting r35 i.e., rating of item i5 for user u3.

• Item based CF : r35 = (c52 x r32 + c54 x r34) / (c52 + c54) where items i2 and i4 are similar to i5

• User based CF : r35 = (c31 x r15 + c32 x r25) / (c31 +c32) where users u1 and u2 are similar to u3

• cij = similarity coefficient between items i and j or users i and j and rij = rating of item j by user i


Rating Estimation

• In the previous slide, we assumed rating data for item, user pair was already available, through some rating mechanism a.k.a explicit rating.

• However there may not be a product rating feature available in a site.

• Even if the rating feature is there, many users may not use it.Even if many users rate, explicit rating by users tend to be biased.

• We need a way to estimate rating based on user behavior in the site and some heuristic a.k.a implicit rating


Heuristics for Rating: An Example

User Activity Rating

1-2 product views 1

3-5 product views 2

More than 5 product views 3

Item in shopping cart and then abandoned 4

Item added to wish list 4

Item bought 5


Similarity computation

• For item based CF, the first step is finding similar items. For user based CF, the first step is finding similar users

• We will use Pearson Correlation Coefficient. It indicates how well a set of data points lie in a straight line. In a 2 dimensional space of 2 items, rating of the 2 items by an user is a data point.

• There are other similarity measure algorithms e.g., euclidian distance, cosine distance


Pearson Correlation Coefficient

• c(i,j) = cov(i,j) / (stddev(i) * stddev(j)) • cov(i,j) = sum ((r(u,i) - av(r(i)) * (r(u,j) - av(r(j))) / n • stddev(i) = sqrt(sum((r(u,i) - av(r(i)) ** 2) / n) • stddev(j) = sqrt(sum((r(u,j) - av(r(j)) ** 2) / n) • The covariance can also be expressed in this alternative form, which we will be

using cov(i,j) = sum(r(u,i) * r(u,j)) / n - av(r(i)) * av(r(j) • c(i,j) = Pearson correlation coefficient between product i and j • cov(i,j) = Covariance of rating for products i and j • stddev(i) = Std deviation of rating for product i • stddev(j) = Std deviation of rating for product j • r(u,i) = Rating for user u for product i • av(r(i)) = Average rating for product i over all users that rated • sum = Sum over all users • n = Num of data points


Map Reduce

• We are going to have 2 MR jobs working in tandem for items based CF. Additional preprocessing MR jobs are also necessary to process click stream data.

• The first MR calculates correlation for all item pairs, based on rating data. Essentially finds similar items.

• The second MR takes the output of the first MR and the rating data for the user in question. The output is a list of items ranked by predicted rating


Correlation Map Reduce

• It takes two kinds of input. The first kind has item id pair and two mean and std dev values for the ratings . This is generated by another pre processor MR.

• The second input has item rating for all users. This is generated by another preprocessor MR analyzing click stream data. Each row is for one user along with variable number of product ratings by an user


Correlation Mapper Input

pid pid m s m s

pid3 pid7 m3 s3 m7 s7



uid pid rate pid rate pid rate

u1 p3 r13 p5 r15 p8 r18

u4 p7 r47

u7 p6 r76 p2 r72


Correlation Mapper Output

• The mapper produces two kinds of output. • The first kind contains {pid1,pid2,0 -> m1,s1,m2ms2}.

It’s the mean and std dev for a pid pair• The second kind contains {pid1,pid2,1 -> r1xr2}. It’s

the product of rating for the pid pair for some user.• We are appending 0 and 1 to the mapper output key,

for secondary sorting which will ensure that for a given pid pair, the reducer will receive the value of the first kind of record followed by multiple values of the second kind of mapper output


Correlation Mapper Output

key value

pid1,pid2,0 1.5, .06, -.6, .09

pid1,pid2,1 15

pid5,pid8,1 9

Pid5,pid8,1 12

pid3,pid7,1 6

Pid3,pid5,0 0.8, .10, .5, .03


Correlation Reducer

• Partitioner based on the first two tokens of key (pid1,pid2), so that the values for the same pid pair go to the same reducer

• Grouping comparator on the first two tokens of key (pid1,pid2), so that all the mapper out put for the same pid pair is treated as one group and passed to the reducer in one call

• The reducer output is pid pair and the corresponding correlation coefficient {pid1,pid2 -> c12}

• For a pid pair, the reducer has at it’s disposal all the data for Pearson correlation computation.


Correlation Reducer Output

pid pid correlation

pid2 pid5 c25

pid3 pid1 c31

pid4 pid5 c45


Prediction Map Reduce

• This is the second MR that takes item correlation data which is the output of the first MR and the rating data for the target user.

• We are running this MR to make rating prediction and ultimately recommendation for an user. The user rating data is passed to Hadoop as so called “side data”.

• The mapper output consists of pid of an item as the key and the rating of the related item multiplied by the correlation coefficint and the correlation coefficient as the value. {pid1 -> rating(pid3) x c13, c13}


Prediction Mapper Input

pid pid corr

pid2 pid4 c24

pid7 pid1 c71

pid3 pid4 c34

pid rating

pid3 r3

pid5 r5

pid2 r2


Prediction Mapper Output

pid weighted rating corr

pid1 rating(pid3) x c13 c13

pid4 rating(pid7) x c47 c47

---- ---- ----


Prediction Reducer

• The reducer gets a pid as a key and a list of tuples as value. Each tuple consists of weighted rating of a related item and the corresponding correlation coefficient. {pid1 -> [(pid3 x c31, c31), (pid5 x c51, c51),…..]

• The reducer sums up the weighted rating and divides the sum by sum of correlation value. This is the final predicted rating for an item.

• The reducer output is an item pid and the predicted rating for the item. All that remains is to sort the predicted ratings and use the top n items for making recommendation


Realtime Prediction

• We would like to make recommendation when there is a significant event e.g., item gets put on a shopping cart.

• But Hadoop is an offline batch processing system. How do we circumvent that? We have to do pre computation and cache the results.

• There are 2 MR jobs: Correlation MR to calculate item correlation and Prediction MR to prediction rating.

• We should re run the 2 MR jobs as necessary when significant change in user item rating is detected


Pre Computation

• As mentioned earlier item correlation is relatively stable and only needs to be re computed when there is significant change in the utility matrix

• Correlation MR for item similarity should be run only after significant over all change in utility matrix has been detected, since the last run.

• For a given user, which is basically a row in the utility matrix, if significant change is detected e.g., new rating by the user for a product is available, we should re run rating prediction MR for the user.


Cold Start Problem

• How do we make recommendation when a new item is introduced in the inventory or a new user visits the site

• For new item, although we have no user interest data available we can use content based recommendation. Essentially, it’s similarity computation based on the attributes of the item only.

• For new user (cold user?) the problem is much harder, unless detailed user profile data is available.


Some Temporal Issues

• When does an item have enough rating data to be accurately recommendable? How to define the threshold?

• When is there enough user rating, to be able to get good recommendations? How to define the threshold?

• How to deal with old ratings, as users interest shifts with passing time?

• When is there enough data in the utility matrix to bootstrap the recommendation system?


Resources

• My 2 part blog posts on this topic at http://pkghosh.wordpress.com

• “Programming Collective Intelligence” by Toby Segaram, O’Reilly

• “Mining of Massive Datasets” by Anand Rajaraman and Jeffrey Ullman


Technology

Recommendation Engine Powered by Hadoop