38
Data Science at Scale Spark – Zeppelin - ML Kirk Haslbeck, Sr. Solution Engineer HWX

Spark-Zeppelin-ML on HWX

Embed Size (px)

Citation preview

Page 1: Spark-Zeppelin-ML on HWX

Data Science at ScaleSpark – Zeppelin - ML

Kirk Haslbeck, Sr. Solution Engineer HWX

Page 2: Spark-Zeppelin-ML on HWX

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kirk Haslbeck - Hortonworks

Sr. Solution Engineer @ Hortonworks

Lead Architect for Trade Surveillance @ Morgan Stanley

Masters in Data Mining @UMBC

Computer Science Degree @ Mount Saint Mary’s University

github.com/kirkhas/zeppelin-notebooks

Page 3: Spark-Zeppelin-ML on HWX

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Page 4: Spark-Zeppelin-ML on HWX

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark – Apache Open Source Project

Page 5: Spark-Zeppelin-ML on HWX

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why do we need Spark?

Distributed – Multi-threading is hard to do in Java but even if you get it right it isn’t distributed. It is limited to a

single JVM

Horizontal– Spark can take advantage of a modern data architecture. Scales out as a function of hardware.

Data Science– Language R, Python both growing in popularity and great for statistical workloads but suffer from

single threaded nature.

Need for a top level computing language– SQL is great and provides a lot of what we need but not everything. Tradeoffs occur when SQL is

better for some operations and a full programming language for others. Spark satisfies both!

Page 6: Spark-Zeppelin-ML on HWX

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark API Languages

Page 7: Spark-Zeppelin-ML on HWX

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark - Functional + Distributed = Concise and Powerful

Spark Map Function Java Thread Pool

Objective: we have a list of tasks and we want to pad each project timeline with 20% time buffer

Page 8: Spark-Zeppelin-ML on HWX

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Spark chose Scala?

Functional– Map, Filter, Fold, GroupBy– 5-10X code reduction

Immutable– No state management, less headache, each operation is fully encapsulated.

Thread Safety is the Biggest Challenge

Page 9: Spark-Zeppelin-ML on HWX

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDDs, DataFrames and DataSets

Resilient Distributed Dataset– Good for schema – case class Trade (sym: String, price: Double)

DataFrame– SQL like operations, higher level object– aggregations, ordering

Interoperability– Finally interop between Tables, Classes, and Vectors for Data Science. Borrowing the best from R,

Scala and SQL. Impedance mismatch solved, no need for Domain Layer, Data Access Layer

Page 10: Spark-Zeppelin-ML on HWX

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

RDD (low level) vs. DataFrames (new API)

Page 11: Spark-Zeppelin-ML on HWX

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark 101 – Execution Model Spark Driver

– Client side application that creates Spark Context Spark Context

– Talks to Spark Driver, Cluster Manager to Launch Spark Executors Cluster Manager – E.g YARN, Spark Standalone, MESOS Executors – Spark worker bees

Page 12: Spark-Zeppelin-ML on HWX

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Engine in the HDP Stack

Spark is first-class citizen of Hadoop

Page 13: Spark-Zeppelin-ML on HWX

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo

Show me the Code!

Page 14: Spark-Zeppelin-ML on HWX

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Model Inputs

Data Gathering

Custom Logic

Process Flow

Evaluate Results

Page 15: Spark-Zeppelin-ML on HWX

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What About Machine Learning?

Page 16: Spark-Zeppelin-ML on HWX

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Machine Learning and Big Data

Machine learning has advanced to the point where it more or less goes hand-in-hand with Big Data. Indeed, so popular is the technology that over a third of developers – some 36 percent – who are working on Big Data or advanced analytics projects use elements of machine learning, says a new study by Evans Data Corp.

Machine Learning involves creating and improving complex algorithms that are able to analyze data automatically and identify patterns or predict outcomes based on the knowledge they have “learned”. As such, it has great potential for helping companies to better understand what their data is telling them.

Page 17: Spark-Zeppelin-ML on HWX

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Where Can We Use Data Science?

Healthcare• Predict diagnosis• Prioritize screenings• Reduce re-admittance rates

Financial services• Fraud Detection/prevention• Predict underwriting risk• New account risk screens

Public Sector• Analyze public sentiment• Optimize resource allocation• Law enforcement & security

Retail• Product recommendation• Inventory management• Price optimization

Telco/mobile• Predict customer churn• Predict equipment failure• Customer behavior analysis

Oil & Gas• Predictive maintenance• Seismic data management• Predict well production levels

Page 18: Spark-Zeppelin-ML on HWX

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Customer Use Cases with Spark

Web Analytics - WebTrendsWeb Analytics for Marketing• Ingesting 13 Billion events/Day• Use Spark Streaming & Samza for Data Ingest• Extremely low latency: 40 milliseconds• Need more metrics for Spark Streaming• Wants 2 way SSL for Kafka Spark receiver

Bank/Credit CardReal time monitoring and Fraud Detection• Monitor ATM with NiFi• Start with Log Aggregation• Tackle fraud detection next

Railroad CompanyReal time view of state of track• Optimize the train maintenance • Large volume of track data, down to feel

granularity• GeoSpatial analytics is critical

Cable CompanyOptimize Advertising• Monitor channel changes with Spark Streaming• Correlate changes with Ads/Programming• Allocate Ads real time: Show ads to user who are

watching a show and will stay for > over 20 seconds

• How to optimize Spark App development

Page 19: Spark-Zeppelin-ML on HWX

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example: Credit Card Fraud Detection

Page 20: Spark-Zeppelin-ML on HWX

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Building a Model Show of hands, how many have built a “Model”? What are some limitations?

– Conditional based logic: if/else binary decisions

If you need a lot of data to build a good model, what tools can you use?– Data volumes can eliminate the possibility of desktop tools

Sampling?– Well… we better get an even distribution of true and false positives in each sample, but wait that

requires data munging, back to what tools can we use.

Security Concerns?– Extracting data from it’s secure resting place and pushing it into other environments, often times

unsecure files or desktops where Matlab or R can be installed.

Collaboration– Push processing to the data using modern distributed tooling.

Page 21: Spark-Zeppelin-ML on HWX

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

“All models are wrong, some are useful”

George E. P. Box

Most limiting factor is the data, with modern systems we are now able to capture more data and hopefully produce better insights

Page 22: Spark-Zeppelin-ML on HWX

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Credit Card Fraud

Requirement: Detect fraudulent transactions. Goal: Save the card company money and build trust amongst card users. Cut down on

fraudulent crime Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt

and make smarter decisions over time. Design

– Distance: How far can one travel over a period of time before it is fraudulent?– Category: How can we detect a purchase that a customer wouldn’t likely make?– Frequency: How can we detect purchasing patterns that do not resemble the card holder?

Ideas?– White board some conditional logic, egregiousness vs binary– Back test the data– Build a model per card holder?

Page 23: Spark-Zeppelin-ML on HWX

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Rules, Statistics, Machine Learning

Rule Based Logic– Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to

over engineer.– Example: Spending Limit. Card holder limit = $2,000

• If (currentPurchaseAmount + balance > 2,000) then deny transaction

Statistics– Mean, median, mode, variance, deviation– Anomaly detection. Outliers. (i.e. womens retail example)

Machine Learning– Supervised– Unsupervised– Trainable– Adapt over time

Page 24: Spark-Zeppelin-ML on HWX

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Discovery

Gathered all Credit Card Transactions– Problem is they didn’t make sense– No identifiable patterns, no log normal curves– Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55

Classification

Page 25: Spark-Zeppelin-ML on HWX

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Outlier Detection: identify abnormal patterns

Example: identify anomaliesFeatures:- Time frequency- Category - Amount- Distance

Page 26: Spark-Zeppelin-ML on HWX

26 © Hortonworks Inc. 2011 – 2016. All Rights ReservedPage 26

Hortonworks Data Flow

Wei Wang
Can you please add the similar slide for the fault detection
Page 27: Spark-Zeppelin-ML on HWX

27 © Hortonworks Inc. 2011 – 2016. All Rights ReservedPage 27

Hortonworks Data Flow

Wei Wang
Can you please add the similar slide for the fault detection
Page 28: Spark-Zeppelin-ML on HWX

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Machine Learning Continued

Page 29: Spark-Zeppelin-ML on HWX

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Classification: predicting a category

Some techniques:- Naïve Bayes- Decision Tree- Logistic Regression- SGD- Support Vector Machines- Neural Network- Ensembles

Page 30: Spark-Zeppelin-ML on HWX

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Regression: predict a continuous value

Some techniques:- Linear Regression / GLM- Decision Trees- Support vector regression- SGD- Ensembles

Page 31: Spark-Zeppelin-ML on HWX

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Unsupervised Learning: detect natural patterns

Age State Annual Income Marital status

25 CA $80,000 M

45 NY $150,000 D

55 WA $100,500 M

18 TX $85,000 S

… … … …

No labels

Model Naturally occurring(hidden) structure

Page 32: Spark-Zeppelin-ML on HWX

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Clustering: detect similar instance groupings

Some techniques:- k-means- Spectral clustering- DB-scan- Hierarchical clustering

Page 33: Spark-Zeppelin-ML on HWX

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Getting the Proper Fit

Over-fitting:Model over-fits training set, but does not generalize well to new inputs

Under-fitting:Model can’t predict accurately

Page 34: Spark-Zeppelin-ML on HWX

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Business Intelligence vs

Data Science

R and Matplotlib now available

Page 35: Spark-Zeppelin-ML on HWX

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

R and Matlab Visuals in Zeppelin

Page 36: Spark-Zeppelin-ML on HWX

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Matplotlib with Python

Page 37: Spark-Zeppelin-ML on HWX

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Appendix – Links to content

Github https://github.com/kirkhas/zeppelin-notebooks

Credit Card Fraud (real-time ML)https://community.hortonworks.com/articles/38457/credit-fraud-prevention-demo-a-guided-tour.html

Monte Carlo / VaRhttps://community.hortonworks.com/articles/39096/predicting-stock-portfolio-gains-using-monte-carlo.html

Stock Variance https://community.hortonworks.com/repos/32713/stock-variance-using-zeppelin.html

Page 38: Spark-Zeppelin-ML on HWX

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved