43
© 2013 Datameer, Inc. All rights reserved.

Top 3 Considerations for Machine Learning on Big Data

Embed Size (px)

DESCRIPTION

View the full recording of this deck here: http://info.datameer.com/Slideshare-Top-3-Things-to-Consider-for-Machine-Learning-on-Big-Data.html Machine learning is powerful but requires coding and access to all the relevant datasets to get full insights. With new Big Data analytic tools, business users can now use machine learning to gain a competitive edge. Based on best practices and customer experiences, join Datameer and Caserta Concepts as we discuss what to look for and what value organizations get out of Machine Learning on Big Data. This webinar will provide: *an overview of challenges and tools available today *use cases for machine learning on hadoop *capabilities to look for *comparison of available solutions

Citation preview

Page 1: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Page 2: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Top 3 Things to Consider with Machine Learning on Big Data

Karen HsuElliott Cordo

Page 3: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

About our SpeakersKaren Hsu• Karen is Senior Director, Product Marketing at

Datameer. With over 15 years of experience in enterprise software, Karen Hsu has co-authored 4 patents and worked in a variety of engineering, marketing and sales roles.

• Most recently she came from Informatica where she worked with the start-ups Informatica purchased to bring data quality, master data management, B2B and data security solutions to market. 

• Karen has a Bachelors of Science degree in Management Science and Engineering from Stanford University.  

Page 4: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

About our SpeakersElliott Cordo• Elliott is a data warehouse and information

management expert. He brings more than a decade of experience in implementing data solutions with hands-on experience in every component of the data warehouse software development lifecycle.

• At Caserta Concepts, Elliott oversees large-scale major technology projects, including those involving business intelligence, data analytics, Big Data and data warehousing.

Page 5: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Drivers &Challenges Use Cases Key Criteria Best

Practices Next Steps

Page 6: Top 3 Considerations for Machine Learning on Big Data

Drivers & Challenges

Page 7: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

$0

$75

$150

$225

$300

12/31/0903/31/10

06/30/1009/30/10

12/31/1003/31/11

06/30/1109/30/11

12/31/1103/31/12

06/30/1209/30/12

12/31/1203/21/13

Amazon vs Barnes & Noble

$0

$75

$150

$225

$300

12/31/0903/31/10

06/30/1009/30/10

12/31/1003/31/11

06/30/1109/30/11

12/31/1103/31/12

06/30/1209/30/12

12/31/1203/21/13

NetFlix vs Blockbuster

Big Data Analytics Drives Results

Big Data Drives Results

Page 8: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

• Hard to use• Requires PHD experts• Must write code• Expensive

• Fixed DW models• Must write code for

analytics• Very high IT labor

costs• Not agile

• Easy for small teams• Can’t manage large data

volume• Lack support of advanced

analytics

DataMining

TraditionalBI

Visualization

Alternatives Are Lacking

Page 9: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Job Title Bay Area New YorkIT Project Manager 140,000.00 $126,000.00System Administrator 117,000.00 $105,000.00Network Administrator 119,000.00 $107,000.00Database Administrator

125,000.00 $119,000.00IT Security Manager 116,000.00 $104,000.00Business Intelligence Analyst 137,000.00 $133,000.00

Data Scientist 138,000.00 $133,000.00Java Developer 136,000.00 $133,000.00QA Engineer 120,000.00 $114,000.00

1,148,000.00 $1,074,000.00

$1M+ in Salaries

$1M+ in CapitalSolution Cost / 100TB

Teradata EDW 1,650,000.00Oracle Exadata 1,400,000.00IBM Netezza 1,000,000.00

Costs of Building Can be $1M+

Page 10: Top 3 Considerations for Machine Learning on Big Data

Use Cases

Page 11: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Use Case What is Revealed

Profiling and segmentation Customer, product, market characteristics and segments

Acquisition and retention

What leads a person to become a customer or stop being a customer

Product development and operations optimization

What led to product or network failure

Campaign management Patterns of successful campaigns

Cross-sell / up-sell Recommendations on services, products, or advisors for a given user/customer profile

Use Cases

Page 12: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Industry Use Case

Financial Services• Show correlation between services purchased and

investments/trades made• Identify customer segments• Recommendations for research articles to drive trading

eCommerce• Show types of events person will like• Decision tree based on likelihood to click through• Recommendations for a large “cold start” population

Gaming• Clustering for user profiles• Correlation between attributes of a game and behavior• Churn analysis

Healthcare • Recommend tests or other offerings• Identify factors/trends that lead to disease

Customer Examples

Page 13: Top 3 Considerations for Machine Learning on Big Data

Polling Question I

Page 14: Top 3 Considerations for Machine Learning on Big Data

Key Criteria

Page 15: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Ease of Use Quality

Page 16: Top 3 Considerations for Machine Learning on Big Data

Clustering

Page 17: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

K-Means

1. Treats items as coordinates2. Places a number of random

“centroids” and assigns the nearest items

3. Moves the centroids around based on average location

4. Process repeats until the assignments stop changing

*Diagram from Collective Intelligence by Toby Segaran

• K-means is a popular and versatile general purpose clustering algorithm.

• Commonly used to group people and objects together to form segments

• Often leveraged to enhance recommendation and search systems

How it works

Clustering Overview

Page 18: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

First, the set up...

And then run the results...

In Datameer, you select the columns... And get the results

And the quality of results increases with larger data sets…

Ease of Use

And write additional code to scale...

Page 19: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

pca <- princomp(iris[1:4]);colors <- kmeans(iris[1:4], 3)$cluster;plot(pca$scores[,1], pca$scores[,2], col=colors, pch=5);

First, you have to set up...

And then run the results...

In Datameer, you select the columns... And get the results

And then write more code to scale...

Ease of Use

Page 20: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Second, you need to create the cluster...

First, select the data...

And then see the results

In Datameer, you select the columns... And get the results

Ease of Use

Page 21: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.*Diagram from Collective Intelligence by Toby Segaran

User Location Company Favorite Algo

Elliott New Jersey Caserta K-Means

Karen California Datameer K-Means

User Location Company Favorite Algo1001 1 101 1001

1002 2 102 1001

1. First a dataset’s attirbutes must be converted to numeric representations

Ease of Use

In Datameer, you select the columns... And get the results

2. This numeric dataset is then converted to a sequence file, then sparse vector leveraging Seqdirectory and seq2sparse 

3. Mahout is called, number of clusters, distance calculation is specifiedbin/mahout kmeans \ -i /user/kmeans/vectors \ -c /user/kmeans/input \ -o /user/kmeans/output \ -k 200 \ -dm CosineSimilarity \ -x 20\ -ow

4. The sparse vector output is then converted back to a delimted format,

5. Textual attributes willl be appended back to the record, numeric values preserved for ad-hoc distance comparison of members within a cluster

Page 22: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

Page 23: Top 3 Considerations for Machine Learning on Big Data

Column Dependencies

Page 24: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

A Ba xb yb ya xc za y

Column Dependency ~

0.99

C Da xb xb ya zc ya y

Column Dependency ~

0.01

Value•See how data is related after joining multiple sets of data•See column dependencies on multiple types of data

Column Dependencies Overview

Page 25: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

-3 -2 -1 0 1 2 3

-2-1

01

2ColumnDependency(A,B) = 0

Column A

Col

umn

B

-2 -1 0 1 2 3

-50

5

ColumnDependency(A,B) = 0.5

Column AC

olum

n B

-2 -1 0 1 2

-50

5

ColumnDependency(A,B) = 0.5

Column A

Col

umn

B-3 -2 -1 0 1 2 3

-6000

-4000

-2000

02000

4000

6000

ColumnDependency(A,B) = 1

Column A

Col

umn

B

ColumnDependency(A,B) = 0.5

Column A (NUMBER)

Col

umn

B (S

TRIN

G)

0 0.5 1 1.5 2 2.5 3

ab

c

ColumnDependency(A,B) = 1

Column A (NUMBER)

Col

umn

B (S

TRIN

G)

1 2 3 4 5 6 7 8 9 10 12 14

ab

cd

ef

gh

ij

klm

no

Page 26: Top 3 Considerations for Machine Learning on Big Data

Decision Tree

Page 27: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Goal: Create a model that predicts the value of a target based on several inputs.

Decision Tree Overview

Page 28: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

packages.install(rpart);library(rpart);treeInput <- read.csv("/PathToData/iris.csv");fit <- rpart(class ~ sepalLength+sepalWidth+petalLength+petalWidth, data=treeInput);par(mfrow=c(1,2), xpd=NA);plot(fit);text(fit, use.n=TRUE);

First, you need to code...

And then run the results...

And then write more code to scale...

In Datameer, you select the columns... And get the results

Ease of Use

Page 29: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Second, you configure the settings...

First, select the data...

And then see the results

In Datameer, you select the columns... And get the results

Ease of Use

Page 30: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

Iris WineBreast  Cancer  

Wisconsin

R 92.66% 86.47% 92.86%

Weka 95.33% 89.33% 93.5%

Datameer 93.33% 91.18% 93.04%

Page 31: Top 3 Considerations for Machine Learning on Big Data

Recommendations

Page 32: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Increased revenue

Your customers expect them

What makes a good recommendation?

Combination of algorithms and Hadoop make effective recommendations platform achievable

Recommendations Overview

Page 33: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

# run factorization of ratings matrix$MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output ${WORK_DIR}/als/out \    --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 --numThreadsPerSolver 2

# compute recommendations$MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ --output ${WORK_DIR}/recommendations/ \    --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/M/ \    --numRecommendations 6 --maxRating 5 --numThreads 2

First, the set up...

And then run the results...

In Datameer, you select the columns... And get the results

1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]2 [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515:5.0,508:5.0,496:5.0,483:5.0]3 [137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.75,124:4.7,319:4.703,29:4.67,591:4.6]4 [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,483:5.0,475:5.0,471:5.0,876:5.0]5 [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:5.0,514:5.0,511:5.0,508:5.0]6 [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5.0,527:5.0,526:5.0,521:5.0]

Ease of Use

Page 34: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Quality Comparison

Shawshank Godfather PulpFiction

FightClub

Dianna 4.76 4.98 1.95 2.44

Jon 1.99 2.51 2.87 4.83

Karen 3.28 4.72 1.89 2.95

Elliott 2.92 3.64 2.97 4.83

Same Results

Page 35: Top 3 Considerations for Machine Learning on Big Data

Best Practices

Page 36: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Big Data Analytics Process

Integrate

Prepare andAnalyze

Visualize

DefineDeploy

AdHoc

Production

Page 37: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

• Leverage Hierarchies

• If possible, use numbering schemes

• Scale the surrogate key of attributes

• Try different cluster sizes

• Avoid numeric similarities when building your data

Clustering

Page 38: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

• Leverage a combination of algorithms

• Clustering is your friend!

• Treat cold start situations differently

• Think about ranking

• Don’t let recommendations go wild

Item-Based K-Means:Similar

Item Similarity

Best Recommendations

Recommendations

Page 39: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

Process Best Practices

IterateMap Chain

Page 40: Top 3 Considerations for Machine Learning on Big Data

Demonstration

Page 41: Top 3 Considerations for Machine Learning on Big Data

Polling Question II

Page 42: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

FunnelOptimization

BehavioralAnalytics

FraudPrevention

EDWOptimization

CustomerSegmentation

Increase Customer conversion by 3x

Increase Revenue by 2x

Identify $2B in potential fraud

98% OpEx savings$1M+ CapEx

savings

Lower Customer Acquisition Costs by

30%

Return on Investment

Page 43: Top 3 Considerations for Machine Learning on Big Data

© 2013 Datameer, Inc. All rights reserved.

WorkshopContact•Elliott Cordo [email protected]

•Karen Hsu [email protected]

Call to Action