Spark-Zeppelin-ML on HWX

Data Science at ScaleSpark – Zeppelin - ML

Kirk Haslbeck, Sr. Solution Engineer HWX

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kirk Haslbeck - Hortonworks

Sr. Solution Engineer @ Hortonworks

Lead Architect for Trade Surveillance @ Morgan Stanley

Masters in Data Mining @UMBC

Computer Science Degree @ Mount Saint Mary’s University

github.com/kirkhas/zeppelin-notebooks



Spark – Apache Open Source Project


Why do we need Spark?

Distributed – Multi-threading is hard to do in Java but even if you get it right it isn’t distributed. It is limited to a

single JVM

Horizontal– Spark can take advantage of a modern data architecture. Scales out as a function of hardware.

Data Science– Language R, Python both growing in popularity and great for statistical workloads but suffer from

single threaded nature.

Need for a top level computing language– SQL is great and provides a lot of what we need but not everything. Tradeoffs occur when SQL is

better for some operations and a full programming language for others. Spark satisfies both!


Spark API Languages


Spark - Functional + Distributed = Concise and Powerful

Spark Map Function Java Thread Pool

Objective: we have a list of tasks and we want to pad each project timeline with 20% time buffer


Why Spark chose Scala?

Functional– Map, Filter, Fold, GroupBy– 5-10X code reduction

Immutable– No state management, less headache, each operation is fully encapsulated.

Thread Safety is the Biggest Challenge


RDDs, DataFrames and DataSets

Resilient Distributed Dataset– Good for schema – case class Trade (sym: String, price: Double)

DataFrame– SQL like operations, higher level object– aggregations, ordering

Interoperability– Finally interop between Tables, Classes, and Vectors for Data Science. Borrowing the best from R,

Scala and SQL. Impedance mismatch solved, no need for Domain Layer, Data Access Layer


RDD (low level) vs. DataFrames (new API)


Spark 101 – Execution Model Spark Driver

– Client side application that creates Spark Context Spark Context

– Talks to Spark Driver, Cluster Manager to Launch Spark Executors Cluster Manager – E.g YARN, Spark Standalone, MESOS Executors – Spark worker bees


Spark Engine in the HDP Stack

Spark is first-class citizen of Hadoop


Demo

Show me the Code!


Model Inputs

Data Gathering

Custom Logic

Process Flow

Evaluate Results


What About Machine Learning?


Machine Learning and Big Data

Machine learning has advanced to the point where it more or less goes hand-in-hand with Big Data. Indeed, so popular is the technology that over a third of developers – some 36 percent – who are working on Big Data or advanced analytics projects use elements of machine learning, says a new study by Evans Data Corp.

Machine Learning involves creating and improving complex algorithms that are able to analyze data automatically and identify patterns or predict outcomes based on the knowledge they have “learned”. As such, it has great potential for helping companies to better understand what their data is telling them.


Where Can We Use Data Science?

Healthcare• Predict diagnosis• Prioritize screenings• Reduce re-admittance rates

Financial services• Fraud Detection/prevention• Predict underwriting risk• New account risk screens

Public Sector• Analyze public sentiment• Optimize resource allocation• Law enforcement & security

Retail• Product recommendation• Inventory management• Price optimization

Telco/mobile• Predict customer churn• Predict equipment failure• Customer behavior analysis

Oil & Gas• Predictive maintenance• Seismic data management• Predict well production levels


Customer Use Cases with Spark

Web Analytics - WebTrendsWeb Analytics for Marketing• Ingesting 13 Billion events/Day• Use Spark Streaming & Samza for Data Ingest• Extremely low latency: 40 milliseconds• Need more metrics for Spark Streaming• Wants 2 way SSL for Kafka Spark receiver

Bank/Credit CardReal time monitoring and Fraud Detection• Monitor ATM with NiFi• Start with Log Aggregation• Tackle fraud detection next

Railroad CompanyReal time view of state of track• Optimize the train maintenance • Large volume of track data, down to feel

granularity• GeoSpatial analytics is critical

Cable CompanyOptimize Advertising• Monitor channel changes with Spark Streaming• Correlate changes with Ads/Programming• Allocate Ads real time: Show ads to user who are

watching a show and will stay for > over 20 seconds

• How to optimize Spark App development


Example: Credit Card Fraud Detection


Building a Model Show of hands, how many have built a “Model”? What are some limitations?

– Conditional based logic: if/else binary decisions

If you need a lot of data to build a good model, what tools can you use?– Data volumes can eliminate the possibility of desktop tools

Sampling?– Well… we better get an even distribution of true and false positives in each sample, but wait that

requires data munging, back to what tools can we use.

Security Concerns?– Extracting data from it’s secure resting place and pushing it into other environments, often times

unsecure files or desktops where Matlab or R can be installed.

Collaboration– Push processing to the data using modern distributed tooling.


“All models are wrong, some are useful”

George E. P. Box

Most limiting factor is the data, with modern systems we are now able to capture more data and hopefully produce better insights


Credit Card Fraud

Requirement: Detect fraudulent transactions. Goal: Save the card company money and build trust amongst card users. Cut down on

fraudulent crime Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt

and make smarter decisions over time. Design

– Distance: How far can one travel over a period of time before it is fraudulent?– Category: How can we detect a purchase that a customer wouldn’t likely make?– Frequency: How can we detect purchasing patterns that do not resemble the card holder?

Ideas?– White board some conditional logic, egregiousness vs binary– Back test the data– Build a model per card holder?


Rules, Statistics, Machine Learning

Rule Based Logic– Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to

over engineer.– Example: Spending Limit. Card holder limit = $2,000

• If (currentPurchaseAmount + balance > 2,000) then deny transaction

Statistics– Mean, median, mode, variance, deviation– Anomaly detection. Outliers. (i.e. womens retail example)

Machine Learning– Supervised– Unsupervised– Trainable– Adapt over time


Discovery

Gathered all Credit Card Transactions– Problem is they didn’t make sense– No identifiable patterns, no log normal curves– Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55

Classification


Outlier Detection: identify abnormal patterns

Example: identify anomaliesFeatures:- Time frequency- Category - Amount- Distance

26 © Hortonworks Inc. 2011 – 2016. All Rights ReservedPage 26

Hortonworks Data Flow

Wei Wang

Can you please add the similar slide for the fault detection

27 © Hortonworks Inc. 2011 – 2016. All Rights ReservedPage 27

Hortonworks Data Flow

Wei Wang

Can you please add the similar slide for the fault detection


Machine Learning Continued


Classification: predicting a category

Some techniques:- Naïve Bayes- Decision Tree- Logistic Regression- SGD- Support Vector Machines- Neural Network- Ensembles


Regression: predict a continuous value

Some techniques:- Linear Regression / GLM- Decision Trees- Support vector regression- SGD- Ensembles


Unsupervised Learning: detect natural patterns

Age State Annual Income Marital status

25 CA $80,000 M

45 NY $150,000 D

55 WA $100,500 M

18 TX $85,000 S

… … … …

No labels

Model Naturally occurring(hidden) structure


Clustering: detect similar instance groupings

Some techniques:- k-means- Spectral clustering- DB-scan- Hierarchical clustering


Getting the Proper Fit

Over-fitting:Model over-fits training set, but does not generalize well to new inputs

Under-fitting:Model can’t predict accurately


Business Intelligence vs

Data Science

R and Matplotlib now available


R and Matlab Visuals in Zeppelin


Matplotlib with Python


Appendix – Links to content

Github https://github.com/kirkhas/zeppelin-notebooks

Credit Card Fraud (real-time ML)https://community.hortonworks.com/articles/38457/credit-fraud-prevention-demo-a-guided-tour.html

Monte Carlo / VaRhttps://community.hortonworks.com/articles/39096/predicting-stock-portfolio-gains-using-monte-carlo.html

Stock Variance https://community.hortonworks.com/repos/32713/stock-variance-using-zeppelin.html


Technology

Spark-Zeppelin-ML on HWX