Running with Elephants
Predictive Analytics with Mahout & HDInsight
Introduction
Chris Price Senior BI Consultant with Pragmatic Works
AuthorRegular SpeakerData Geek & Super Dad!
@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]
You are the demo….
SQL Brewhaushttp://sqlbrewhaus.azurewebsites.net
Create an Account…
Rate some beers…
Don’t worry your infowill only be sold to the HIGHEST bidder
Agenda
• Business Case for Recommendations• How a Recommendation Engine Works• Recommendation Implementation & Integration• Evaluating Recommendations• Challenges of Implementing Recommendations
Making the Business Case
ObjectiveIncreaseRevenue
Increase # of Orders
Increase Items per
Order
Increase Average
Item PriceUp-Sell Website
Navigational
Inefficiency
Cross-Sell
Business Case Example
Up- Sell
Increase
Unit Pric
e
Cross-Sell
Increase Unit Qty
IncreasedRevenue
Recommendation Engines
• Take observation data and use data mining/machine learning algorithms to predict outcomes
• Assumptions:• People with similar interest have common preferences• Sufficiently large number of preferences available
Recommendation Options
• Collaborative Filtering (Mahout)• User-Based• Item-Based
• Content-Based (Mahout Clustering)• Data Mining (SSAS)
• Association• Clustering
Technology
• A scalable machine learning library• Fast, Efficient & Pragmatic• Many of the algorithms can be run on Hadoop
HDInsight• Hadoop on Windows• HDInsight on Windows Azure (Seamlessly scale in the
cloud)• HortonWorks Data Platform/HDP (On-Premise Solution)
Generating Recommendations
1. Sources of Data2. Clean & Prepare Data3. Generate Recommendations• Build User/Item matrix• Calculate User Similarity• Form Neighborhoods• Generate Recommendations
Sources of Data
• Implicit• Ratings• Feedback• Demographics• Psychographics (Personality/Lifestyle/Attitude),• Ephemeral Need (Need for a moment)
• Explicit• Purchase History• Click/Browse History
• Product/Item• Taxonomy• Attributes• Descriptions
Our focus for today
Data Preparation
• Clean-Up:• Remove Outliers (Z-Score)• Remove frequent buyers (Skew)• Normalize Data (Unity-Based)
• Format Data into CSV input file:<User ID>, <Item ID>, <Rating>
How it Works?
• Build a User/Item Matrix
Item
s
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
Neighborhood Formation
U2
U1
U5
U3
U6
U7
U4
Neighborhood Formation
• Requires some experimentation• Similarity Metrics
• Pearson Correlation• Euclidean Distance• Spearman Correlation• Cosine• Tanimoto Coefficient• Log-Likelihood
How it Works?
• Find users similar to U5
• Use a similarity metric (kNN)
• U1 & U7 are identified as most similar to U5
Item
s
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
… 1 1
N
How it Works?
• Generate Recommendations:• Find items that have not been reviewed (I1 and I6)
• Predict rating by taking weighted sum
Item
s
Users
1 2 3 4 5 6 7 8 9 10 … n
1 1 1 1 0.5 1 1
2 1 1 1
3 1 1 1 1 1
4 1 1 1
5 1 1
6 0.7 1
Pseudo-Code Implementation
for each item i that u has no preferencefor each user v that has a preference for i
compute similarity s between u and vcalculate running average of v‘s
preference for i, weighted by s
return top ranked (weighted average) i
Restrict to Neighborhood
Mahout Implementation
• Real-Time Recommendations• Write Java Code and host in JVM Instance• Limited scalability• Requires Training Data• Integration typically handled through web services
• Batch-Based Recommendations• Uses MapReduce jobs on Hadoop• Offline, Slow, yet scalable• Out-of-the-box recommender jobs
Mahout MapReduce Implementation1 – Generate List of ItemIDs2 – Create Preference Vector3 – Count Unique Users4 – Transpose Preference Vectors5 – Row Similarity
• Compute Weights• Computer Similarities• Similarity Matrix
6 – Pre-Partial Multiply, Similarity Matrix7 – Pre-Partial Multiply, Preferences8 – Partial Multiple (Steps 6 & 7)9 – Filter Items10 – Aggregate & Recommend
Integrating Mahout
• Real-Time• Requires Java coding• Web Service• Process:• Load training data (memory pressure)• Generate recommendations
• Batch• ETL from source• Generate input file (UserID, ItemID, Rating)• Load to HDFS
• Process with Mahout/Hadoop• ETL output from HDFS/Hadoop
• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]• UserID [ItemID:Estimate Rating, ………]
Handling Recommendations
Storing Recommendations:• Hive• Data Warehouse system for Hadoop• Hive ODBC Driver
• MongoDB• Leading NOSQL database• JSON-like storage with flexible schema• C#/.Net MongoDB Driver
• HBase• Open-source distributed, column-oriented database modeled
after Google’s BigTable• Use Pig/MapReduce to process output files and load HBase
table• Java API for easy reading
• Source System (SQL Server, etc)
Evaluating the Recommendations
• How good are your recommendations?• How do you evaluate the recommendation
engine?• Two options both split data into test & training
data sets:• Average Difference• Root-Mean Square
• How it works?I1 I2 I3
Estimated Review 3.5 4.0 1.5
Actual Review 4.0 2.0 2.0
Absolute Difference 0.5 2.0 0.5
Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0
Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
Evaluating the Recommendations
DataModel model = new FileDataModel(new File(“ratings.csv”));
RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder bldr = new RecommenderBuilder(){@Overridepublic Recommender buildRecommender(DataModel model) throws TasteException{
//Use the Pearson Correlation to calculate similarityUserSimilarity similarity = new PearsonCorrelationSimilarity(model);//Generate neighborhoods of approx. 10 usersUserNeighborhood hood = new NearestUserNeighborhood(10, similarity,
model);return new GenericUserBasedRecommender(model, hood, similarity);
}};
//Use 70% of the data to train the model and 30% to testdouble score = eval.evaluate(bldr, model, 0.7, 1.0);
Challenges
1. Context2. Cold Start3. Data Scarsity4. Popularity Bias5. Curse of Dimensionality
Context Challenges
???January
20 degrees & Snowing…..
Other Challenges
• Cold Start• Occurs when either a new item or new user is introduced• Can be handled by:• Can substitute average item/user profile• Use another recommendation generation technique
(Content-Based)
• Data Sparsity• Too many items/user make finding intersections difficult
• Popularity Bias• Skewed towards popular items, people with “unique”
taste are left out
• Curse of Dimensionality• More items/user leads to more noise and greater error
Resources
Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman
Hadoop: The Definitive GuideTom White