29
Running with Elephants Predictive Analytics with Mahout & HDInsight

Running with Elephants: Predictive Analytics with HDInsight

Embed Size (px)

DESCRIPTION

Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of the demo as your clickstream data at our fictional e-commerce website drives user and product recommendations using the built-in Mahout (Taste) algorithms. In this action pack session, real-world and practical solutions for moving data into and out of HDFS (with Sqoop), using Mongo or HBase as a source/destination and of course handling Mahout processing in distributive mode will all be covered.

Citation preview

Page 1: Running with Elephants: Predictive Analytics with HDInsight

Running with Elephants

Predictive Analytics with Mahout & HDInsight

Page 3: Running with Elephants: Predictive Analytics with HDInsight

You are the demo….

SQL Brewhaushttp://sqlbrewhaus.azurewebsites.net

Create an Account…

Rate some beers…

Don’t worry your infowill only be sold to the HIGHEST bidder

Page 4: Running with Elephants: Predictive Analytics with HDInsight

Agenda

• Business Case for Recommendations• How a Recommendation Engine Works• Recommendation Implementation & Integration• Evaluating Recommendations• Challenges of Implementing Recommendations

Page 5: Running with Elephants: Predictive Analytics with HDInsight

Making the Business Case

ObjectiveIncreaseRevenue

Increase # of Orders

Increase Items per

Order

Increase Average

Item PriceUp-Sell Website

Navigational

Inefficiency

Cross-Sell

Page 6: Running with Elephants: Predictive Analytics with HDInsight

Business Case Example

Up- Sell

Increase

Unit Pric

e

Cross-Sell

Increase Unit Qty

IncreasedRevenue

Page 7: Running with Elephants: Predictive Analytics with HDInsight

Recommendation Engines

• Take observation data and use data mining/machine learning algorithms to predict outcomes

• Assumptions:• People with similar interest have common preferences• Sufficiently large number of preferences available

Page 8: Running with Elephants: Predictive Analytics with HDInsight

Recommendation Options

• Collaborative Filtering (Mahout)• User-Based• Item-Based

• Content-Based (Mahout Clustering)• Data Mining (SSAS)

• Association• Clustering

Page 9: Running with Elephants: Predictive Analytics with HDInsight

Technology

• A scalable machine learning library• Fast, Efficient & Pragmatic• Many of the algorithms can be run on Hadoop

HDInsight• Hadoop on Windows• HDInsight on Windows Azure (Seamlessly scale in the

cloud)• HortonWorks Data Platform/HDP (On-Premise Solution)

Page 10: Running with Elephants: Predictive Analytics with HDInsight

Generating Recommendations

1. Sources of Data2. Clean & Prepare Data3. Generate Recommendations• Build User/Item matrix• Calculate User Similarity• Form Neighborhoods• Generate Recommendations

Page 11: Running with Elephants: Predictive Analytics with HDInsight

Sources of Data

• Implicit• Ratings• Feedback• Demographics• Psychographics (Personality/Lifestyle/Attitude),• Ephemeral Need (Need for a moment)

• Explicit• Purchase History• Click/Browse History

• Product/Item• Taxonomy• Attributes• Descriptions

Our focus for today

Page 12: Running with Elephants: Predictive Analytics with HDInsight

Data Preparation

• Clean-Up:• Remove Outliers (Z-Score)• Remove frequent buyers (Skew)• Normalize Data (Unity-Based)

• Format Data into CSV input file:<User ID>, <Item ID>, <Rating>

Page 13: Running with Elephants: Predictive Analytics with HDInsight

How it Works?

• Build a User/Item Matrix

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

N

Page 14: Running with Elephants: Predictive Analytics with HDInsight

Neighborhood Formation

U2

U1

U5

U3

U6

U7

U4

Page 15: Running with Elephants: Predictive Analytics with HDInsight

Neighborhood Formation

• Requires some experimentation• Similarity Metrics

• Pearson Correlation• Euclidean Distance• Spearman Correlation• Cosine• Tanimoto Coefficient• Log-Likelihood

Page 16: Running with Elephants: Predictive Analytics with HDInsight

How it Works?

• Find users similar to U5

• Use a similarity metric (kNN)

• U1 & U7 are identified as most similar to U5

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

N

Page 17: Running with Elephants: Predictive Analytics with HDInsight

How it Works?

• Generate Recommendations:• Find items that have not been reviewed (I1 and I6)

• Predict rating by taking weighted sum

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 0.5 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

5 1 1

6 0.7 1

Page 18: Running with Elephants: Predictive Analytics with HDInsight

Pseudo-Code Implementation

for each item i that u has no preferencefor each user v that has a preference for i

compute similarity s between u and vcalculate running average of v‘s

preference for i, weighted by s

return top ranked (weighted average) i

Restrict to Neighborhood

Page 19: Running with Elephants: Predictive Analytics with HDInsight

Mahout Implementation

• Real-Time Recommendations• Write Java Code and host in JVM Instance• Limited scalability• Requires Training Data• Integration typically handled through web services

• Batch-Based Recommendations• Uses MapReduce jobs on Hadoop• Offline, Slow, yet scalable• Out-of-the-box recommender jobs

Page 20: Running with Elephants: Predictive Analytics with HDInsight

Mahout MapReduce Implementation1 – Generate List of ItemIDs2 – Create Preference Vector3 – Count Unique Users4 – Transpose Preference Vectors5 – Row Similarity

• Compute Weights• Computer Similarities• Similarity Matrix

6 – Pre-Partial Multiply, Similarity Matrix7 – Pre-Partial Multiply, Preferences8 – Partial Multiple (Steps 6 & 7)9 – Filter Items10 – Aggregate & Recommend

Page 21: Running with Elephants: Predictive Analytics with HDInsight

Integrating Mahout

• Real-Time• Requires Java coding• Web Service• Process:• Load training data (memory pressure)• Generate recommendations

• Batch• ETL from source• Generate input file (UserID, ItemID, Rating)• Load to HDFS

• Process with Mahout/Hadoop• ETL output from HDFS/Hadoop

• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]• UserID [ItemID:Estimate Rating, ………]

Page 22: Running with Elephants: Predictive Analytics with HDInsight

Handling Recommendations

Storing Recommendations:• Hive• Data Warehouse system for Hadoop• Hive ODBC Driver

• MongoDB• Leading NOSQL database• JSON-like storage with flexible schema• C#/.Net MongoDB Driver

• HBase• Open-source distributed, column-oriented database modeled

after Google’s BigTable• Use Pig/MapReduce to process output files and load HBase

table• Java API for easy reading

• Source System (SQL Server, etc)

Page 23: Running with Elephants: Predictive Analytics with HDInsight

Evaluating the Recommendations

• How good are your recommendations?• How do you evaluate the recommendation

engine?• Two options both split data into test & training

data sets:• Average Difference• Root-Mean Square

• How it works?I1 I2 I3

Estimated Review 3.5 4.0 1.5

Actual Review 4.0 2.0 2.0

Absolute Difference 0.5 2.0 0.5

Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0

Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23

Page 24: Running with Elephants: Predictive Analytics with HDInsight

Evaluating the Recommendations

DataModel model = new FileDataModel(new File(“ratings.csv”));

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder bldr = new RecommenderBuilder(){@Overridepublic Recommender buildRecommender(DataModel model) throws TasteException{

//Use the Pearson Correlation to calculate similarityUserSimilarity similarity = new PearsonCorrelationSimilarity(model);//Generate neighborhoods of approx. 10 usersUserNeighborhood hood = new NearestUserNeighborhood(10, similarity,

model);return new GenericUserBasedRecommender(model, hood, similarity);

}};

//Use 70% of the data to train the model and 30% to testdouble score = eval.evaluate(bldr, model, 0.7, 1.0);

Page 25: Running with Elephants: Predictive Analytics with HDInsight

Challenges

1. Context2. Cold Start3. Data Scarsity4. Popularity Bias5. Curse of Dimensionality

Page 26: Running with Elephants: Predictive Analytics with HDInsight

Context Challenges

???January

20 degrees & Snowing…..

Page 27: Running with Elephants: Predictive Analytics with HDInsight

Other Challenges

• Cold Start• Occurs when either a new item or new user is introduced• Can be handled by:• Can substitute average item/user profile• Use another recommendation generation technique

(Content-Based)

• Data Sparsity• Too many items/user make finding intersections difficult

• Popularity Bias• Skewed towards popular items, people with “unique”

taste are left out

• Curse of Dimensionality• More items/user leads to more noise and greater error

Page 28: Running with Elephants: Predictive Analytics with HDInsight

Resources

Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman

Hadoop: The Definitive GuideTom White