Upload
maccman
View
1.495
Download
0
Tags:
Embed Size (px)
DESCRIPTION
RubyManor talk on using Recommendation systems in production.
Citation preview
Recommendations in Production
Alex MacCaw
Netflix Prize
Amazon.comFacebookLast.fmStumbleUpon
Google Suggest
iTunes
Rotten Tomatoes
Yelp
Google Search
Chicken or Egg
• Google Reader
• IMDB
Acts As Recommendable
Types of recommendations
• Content Based
• User Based
• Item Based
Programming Collective Intelligence
Has Many Through Relationship
User Book
UserBooks
Has Many Has Many
Has Many Through
Can have score (rating)
User
class User < ActiveRecord::Base has_many :user_books has_many :books, :through => :user_books acts_as_recommendable :books, :through => :user_booksend
Gives you
User#similar_usersUser#recommended_booksBook#similar_books
The algorithms
• Manhattan Distance
• Euclidean distance
• Cosine
• Pearson correlation coefficient
• Jaccard
• Levenshtein
How does it work?
Strategy
• Map data into Euclidean Space
• Calculate similarity
• Use similarities to recommend
The Black Knight
John Tucker Must Die
James 4 5
Jonah 3 2
George 5 3
Alex 4 2
0
1.25
2.50
3.75
5.00
0 1.25 2.50 3.75 5.00
The Black Knight
John Tucker Must Die
0
1.25
2.50
3.75
5.00
0 1.25 2.50 3.75 5.00
The Black Knight
John Tucker Must Die
item id
user id
score
{ 1 => { 1 => 1.0, 2 => 0.0, ... }, ...}
[[1, 0.5554], [2, 0.888], [3, 0.8843], ...]
Problem 1
It was far too slow to calculate on the fly(obvious)
SELECT * FROM "users" WHERE ("users"."id" = 2) SELECT * FROM "books" SELECT * FROM "users" SELECT "user_books".* FROM "user_books" WHERE ("user_books".user_id IN (1,2,3,4,5,6,7,8,9,10)) SELECT * FROM "books" WHERE ("books"."id" IN (11,6,12,7,13,8,14,9,15,1,2,19,20,3,10,4,5)) SELECT * FROM "books" WHERE ("books"."id" IN (20,3,19,6))
All books All user_books
Solution
Cache the dataset
rake recommendations:build
Build offline
SELECT * FROM "user_books" WHERE ("user_books".user_id = 2) SELECT * FROM "books" WHERE ("books"."id" = 5) SELECT * FROM "books" WHERE ("books"."id" = 4) SELECT * FROM "books" WHERE ("books"."id" = 8) SELECT * FROM "books" WHERE ("books"."id" = 7) SELECT * FROM "books" WHERE ("books"."id" = 2) SELECT * FROM "books" WHERE ("books"."id" = 1)
Problem 2
Fetching the dataset took too long since it was so massive
Solution
Split up the cache by item
Rails.cache.write("aar_books_1", scores
)
Problem 3
The dataset was so big it crashed Ruby!
Solution
Get rid of ActiveRecord
Only deal with integers
items = options[:on_class].connection.select_values("SELECT id from #{options[:on_class].table_name}").collect(&:to_i)
Problem 4
It still crashed Ruby!
{ 1 => { 1 => 1.0, 2 => 0.0, ... }, ...}
Solution
Remove unnecessary cruft from dataset
{ 1 => { 1 => 1.0, ... }, ...}
Problem 5
It was too slow
Solution
Re-write the slow bits in C
Details
• RubyInline
• Implemented Pearson
• Monkey patched original Ruby methods
• Very fast
Ruby Object
InlineC = Module.new do inline do |builder| builder.c ' #include <math.h> #include "ruby.h" double c_sim_pearson(VALUE items) {
No Floats :(
InlineC = Module.new do inline do |builder| builder.c ' #include <math.h> #include "ruby.h" double c_sim_pearson(VALUE items) {
Hash Lookup
if (!st_lookup(RHASH(prefs1)->tbl, items_a[i], &prefs1_item_ob)) { prefs1_item = 0.0; } else { prefs1_item = NUM2DBL(prefs1_item_ob); }
Conversion
return num / den;
Design Designs
• Not too many relationships
• Not to many ‘items’
• Similarity matrix for items, not users
Changing data
Scaling Even Further
• K Means clustering
• Split cluster by category
Adding ratingsActiveRecord::Schema.define(:version => 1) do create_table "books", :force => true do |t| t.string "name" t.datetime "created_at" t.datetime "updated_at" end create_table "user_books", :force => true do |t| t.integer "user_id", :null => false t.integer "book_id", :null => false t.integer "rating", :default => 0 end create_table "users", :force => true do |t| t.string "name" t.datetime "created_at" t.datetime "updated_at" endend
class User < ActiveRecord::Base has_many :user_books has_many :books, :through => :user_books acts_as_recommendable :books, :through => :user_books, :score => :ratingend
That’s it
Improvements?
• Better API
• Perform calculations over a cluster (EC2) using Map/Nanite
class AARN < Nanite::Actor expose :sim_pearson def sim_pearson(item1, item2) Optimizations.c_sim_pearson(item1, item2) endend
http://eribium.org/blog
twitter : maccmanemail/jabber: [email protected]
Questions?
http://rubyurl.com/kUpk
http://github.com/maccman/acts_as_recommendable