Machine Learning with Apache Spark

Machine Learning With Apache Spark

CodeMash, Sandusky, Ohio, Jan 5-8, 2016David TaiebSTSM-IBM Cloud Data Services

©2015 IBM Corporation

Introduction

David [email protected]

Developer AdvocateIBM Cloud Data Services

Our mission:We are here to help developers realize their most ambitious projects.

https://developer.ibm.com/clouddataservices/connect/


Big data, cloud and the rise of business Analytics

‣Data being collected by enterprises grows exponentially : ERP, embedded systems (IOT)

‣Cloud, with high availability and huge capacity, make more data available for analytics

‣Big data and cloud create new opportunities:- Organizations: more effective decision-

making process, richer client interactions- Business users: discover new insights,

better decision-making process- Developers: access to diverse data sources

and new tools that increase productivity


Why Business Analytics with big data“In God we trust. All others bring data”

W. Edwards Deming

‣Every day, companies make bet-the-business decisions about their customers, competitors and new products

‣Time available for decision-making is shrinking (sometimes real-time)

‣As more and more companies go digital, data becomes the world’s newest resource for competitive advantage

‣Decision making has moved from the elite few to the empowered many

‣Few organizations can keep pace with the appetite for data

Business Analytics TypesDescriptive Analytics Predictive Analytics Prescriptive Analytics

Look at the reason for past success or failure

What is probably going to happen in the future?

What’s my best actions?

• Use interactive querying and visualization to explore and communicate data

• Discover insight and trends• correlation between 2

seemingly unrelated variables

• Data mining• Generate hypothesis and

models

• Predict occurrence of future events using probability (confidence)

• Product recommendations• Classification

• Help make the right decision based on the data

• Find optimal solution to a given problem

Taking Analytics a step further with Cognitive Systems

‣ Use natural language processing and machine learning algorithms to unlock knowledge from massive amount of structured and unstructured data

Decide• Ingest and analyze domain sources, info models• Generate evidence based decisions with confidence• Learn with new outcomes and actions• e.g. - Next generation Apps Probabilistic Apps

Ask• Leverage vast amounts of data• Ask questions for greater insights• Natural language inquiries• e.g. - Next generation Chat

Discover• Find the rationale for given answers• Prompt for inputs to yield improved responses• Inspire considerations of new ideas • e.g. - Next generation Search Discovery

IBM Watson

IBM Cloud Data ServicesResources for developers to get, build, and analyze on the IBM

Cloud


What is spark

Spark is an open sourcein-memory

computing framework for distributed data processing

and iterative analysis

on massive data volumes


Spark Core Libraries

Spark Core

general compute engine, handles distributed task dispatching, scheduling

and basic I/O functions

Spark SQL

Spark Streaming

Mllib (machine learning)

GraphX (graph)

executes SQL

statements

performs streaming

analytics using micro-batches

common machine

learning and statistical algorithms

distributed graph

processing framework


Key reasons for interest in Spark Open Source

Fast

distributed data

processing

Productive

Web Scale

•In-memory storage greatly reduces disk I/O•Up to 100x faster in memory, 10x faster on disk

•Largest project and one of the most active on Apache•Vibrant growing community of developers continuously improve code base and extend capabilities

•Fast adoption in the enterprise (IBM, Databricks, etc…)

•Fault tolerant, seamlessly recompute lost data from hardware failure•Scalable: easily increase number of worker nodes•Flexible job execution: Batch, Streaming, Interactive

•Easily handle Petabytes of data without special code handling•Compatible with existing Hadoop ecosystem

•Unified programming model across a range of use cases•Rich and expressive apis hide complexities of parallel computing and worker node management

•Support for Java, Scala, Python and R: less code written•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX


IBM is all-in on its commitment to Spark

11

Foster CommunityEducate 1M+ data

scientists and engineers via online courses

Sponsor AMPLab, creators and evangelists of Spark

Infuse the PortfolioIntegrate Spark throughout portfolio

3,500 employees working on Spark-related topicsSpark however customers want it – standalone, platform or products

Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss

Launch Spark Technology Cluster (STC), 300 engineers

Open source SystemMLPartner with databricks

Contribute to the Core


Spark MLLib‣Extension to the Spark Core API that provide a library of easy to use

Machine learning algorithms.‣Highly scalable: Leverages Spark ability to work with massive amount of

data‣Fast: Designed for parallel computing‣Cover common Machine Learning algorithms:

- Regression- Classification- Clustering- Recommender Systems- Text Analytics


What is Machine Learning and where is it used‣Subfield of computer science that focuses on getting computers to

learn from data:- Recognize patterns- Make predictions

‣Example use:- Spam filters- Netflix recommendations- Self-driving cars- Watson- …


Typical Machine Learning Flow diagram

Data Acquisition

Data Preparation

Data Annotation

(Ground Truth)

Model Training

• Cleansing• Shaping• Enrichment

Model Testing

Training Set

TestSet

BlindSet

Iterative

Cross-Validation

Evaluate Performance and optimize model

Train Model


MLLib Algorithm Overview• Predictive analytics• Recommendations

• Collaborative Filtering• Matrix Factorization

• Feature extraction and Transformation• TF-IDF• HashingTF• Word2Vec• StandardScaler• Normalizer

• Model Evaluation/Metrics• Binary Classification Metrics• Multi Class Metrics• Regression Metrics


Predictive analyticsContinuous Output Discrete Output

Supervised Learning

(require Ground-Truth)

• Regression - Linear - Ridge - Lasso - Isotonic• Decision Tree• RandomForest• GradientBoostedTree

• Classification - Logistic Regression - SVM - NaiveBayes• Decision Tree• RandomForest• GradientBoostedTree• K-NN (available as add-on spark

package)

Unsupervised Learning

(no Ground-Truth data required)

• Clustering - KMeans - Gaussian Mixture• Dimensionality Reduction - PCA - SVD

• FP-Growth


Featured demo: Flight Delay Predictor‣Use training data collected from flight stats and enriched with weather observations

from “Insight for Weather” service on Bluemix ‣Train multi-class classifier that, given and flight departure weather observations,

can predict the flight delay class:- 0 = Canceled- 1 = On Time- 2 = Delay less than 2 hours- 3 = Delay between 2 and 4 hours- 4 = Delay more than 4 hours

‣Provide metrics measurement for each algorithms- Accuracy- Precision- Recall


Architecture

Weather

Simple Data Pipes

Airports

Flight Schedules

Flight StatusFlightstats Cloudant

Metadata Training Set

Test Set

Blind Set

Custom Connector run every 24 hours

Notebook


Get‣ Identify data sources:

- flightstats.com: https://developer.flightstats.com- Airport metadata: FS Code, geolocation,…- Flight Schedules- Flight Status

- Weather Observations- Insight for Weather on Bluemix

‣ Storage:- Cloudant

‣ Tool used:- Simple Data Pipes custom connector to build Training, Test and Blind data set

‣ Constraints:- Weather service provide past observations as far as 24 hours back only- Flightstats API key is a 30 day trial version, limited to 20,000 calls only

https://developer.flightstats.com/


Custom Pipes Connector to build training data sethttps://developer.ibm.com/clouddataservices/simple-

data-pipe/


Run every 24 hoursBecause Weather service doesn’t return observations older than 24 hours, the data set must be ran every 24 hours


Build: Explore the data with Notebook


Loading training data set


Build: Visualize and explore data setScatter plot of flights delays based on temperature in Departing and Arrival airports


Build: Visualize and explore data setScatter plot of flights delays based on wind speed in Departing and Arrival airports


Constraints‣Past weather observations provided by the “Insight for Weather” service have more

details than forecast data:- Limit the number of features used to train the models to the intersections of the 2.

‣Restrict the training data to weather forecast at departure and arrival airport- Would adding weather data from various point in the route increase the model performance?

‣Difficult to get enough representative data because I was using a trial account on flightstats- Ideally, I would use more airports with better representative weather

‣Didn’t use any categorical features‣For simplicity: Use IPython Notebook as the user interface

- Make the experience less compelling for Business users- To avoid writing too much code in the Notebook, encapsulate some of the business logic in a Python

library- Doesn’t cover as much of the Spark API as Scala


Load labeled data RDD


Load labeled data RDD


Build: NaiveBayes Classification


Build: Decision Tree classification


Build: Random Forest classification


Build: Performance measurementsLoad blind data


Build: Compare metrics between different models


Naïve Bayes vs Decision Tree‣Probabilistic: compute the probability

of a data instance to be in a specific class

‣Assume that each feature (variable) is independent from the others

‣Performance depends on the predictive nature of the features (non predictive features will affect the accuracy)

‣Works well with low amount of training data. Doesn’t need all the possibilities

‣Doesn’t work with categorical features.

‣Non-Probabilistic: partition the data into subsets that best describe the variable

‣The deeper the tree, the better the model fits the data

‣Watch out for overfiting: need to prune the tree

‣Can handle categorical or continuous features

‣No need for input to be scaled or standardized: Set you features and go!

‣Requires a lot of data covering all possibilities


Analyze: Run model


Code: Run Model


If you want to know more

‣https://developer.ibm.com/clouddataservices/

‣https://github.com/ibm-cds-labs/pipes-connector-flightstats

‣http://spark.apache.org/docs/latest/mllib-guide.html

‣https://console.ng.bluemix.net/data/analytics/

https://developer.ibm.com/clouddataservices/

https://developer.ibm.com/clouddataservices/

https://github.com/ibm-cds-labs/pipes-connector-flightstats

https://github.com/ibm-cds-labs/pipes-connector-flightstats

http://spark.apache.org/docs/latest/mllib-guide.html

http://spark.apache.org/docs/latest/mllib-guide.html

https://console.ng.bluemix.net/data/analytics/

https://console.ng.bluemix.net/data/analytics/


Thank you

Data & Analytics

Machine Learning with Apache Spark