Upload
ibm-cloud-data-services
View
1.879
Download
0
Embed Size (px)
Citation preview
Machine Learning With Apache Spark
CodeMash, Sandusky, Ohio, Jan 5-8, 2016David TaiebSTSM-IBM Cloud Data Services
©2015 IBM Corporation
Introduction
David [email protected]
Developer AdvocateIBM Cloud Data Services
Our mission:We are here to help developers realize their most ambitious projects.
https://developer.ibm.com/clouddataservices/connect/
©2015 IBM Corporation
Big data, cloud and the rise of business Analytics
‣Data being collected by enterprises grows exponentially : ERP, embedded systems (IOT)
‣Cloud, with high availability and huge capacity, make more data available for analytics
‣Big data and cloud create new opportunities:- Organizations: more effective decision-
making process, richer client interactions- Business users: discover new insights,
better decision-making process- Developers: access to diverse data sources
and new tools that increase productivity
©2015 IBM Corporation
Why Business Analytics with big data“In God we trust. All others bring data”
W. Edwards Deming
‣Every day, companies make bet-the-business decisions about their customers, competitors and new products
‣Time available for decision-making is shrinking (sometimes real-time)
‣As more and more companies go digital, data becomes the world’s newest resource for competitive advantage
‣Decision making has moved from the elite few to the empowered many
‣Few organizations can keep pace with the appetite for data
Business Analytics TypesDescriptive Analytics Predictive Analytics Prescriptive Analytics
Look at the reason for past success or failure
What is probably going to happen in the future?
What’s my best actions?
• Use interactive querying and visualization to explore and communicate data
• Discover insight and trends• correlation between 2
seemingly unrelated variables
• Data mining• Generate hypothesis and
models
• Predict occurrence of future events using probability (confidence)
• Product recommendations• Classification
• Help make the right decision based on the data
• Find optimal solution to a given problem
Taking Analytics a step further with Cognitive Systems
‣ Use natural language processing and machine learning algorithms to unlock knowledge from massive amount of structured and unstructured data
Decide• Ingest and analyze domain sources, info models• Generate evidence based decisions with confidence• Learn with new outcomes and actions• e.g. - Next generation Apps Probabilistic Apps
Ask• Leverage vast amounts of data• Ask questions for greater insights• Natural language inquiries• e.g. - Next generation Chat
Discover• Find the rationale for given answers• Prompt for inputs to yield improved responses• Inspire considerations of new ideas • e.g. - Next generation Search Discovery
IBM Watson
IBM Cloud Data ServicesResources for developers to get, build, and analyze on the IBM
Cloud
©2015 IBM Corporation
What is spark
Spark is an open sourcein-memory
computing framework for distributed data processing
and iterative analysis
on massive data volumes
©2015 IBM Corporation
Spark Core Libraries
Spark Core
general compute engine, handles distributed task dispatching, scheduling
and basic I/O functions
Spark SQL
Spark Streaming
Mllib (machine learning)
GraphX (graph)
executes SQL
statements
performs streaming
analytics using micro-batches
common machine
learning and statistical algorithms
distributed graph
processing framework
©2015 IBM Corporation
Key reasons for interest in Spark Open Source
Fast
distributed data
processing
Productive
Web Scale
•In-memory storage greatly reduces disk I/O•Up to 100x faster in memory, 10x faster on disk
•Largest project and one of the most active on Apache•Vibrant growing community of developers continuously improve code base and extend capabilities
•Fast adoption in the enterprise (IBM, Databricks, etc…)
•Fault tolerant, seamlessly recompute lost data from hardware failure•Scalable: easily increase number of worker nodes•Flexible job execution: Batch, Streaming, Interactive
•Easily handle Petabytes of data without special code handling•Compatible with existing Hadoop ecosystem
•Unified programming model across a range of use cases•Rich and expressive apis hide complexities of parallel computing and worker node management
•Support for Java, Scala, Python and R: less code written•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
©2015 IBM Corporation
IBM is all-in on its commitment to Spark
11
Foster CommunityEducate 1M+ data
scientists and engineers via online courses
Sponsor AMPLab, creators and evangelists of Spark
Infuse the PortfolioIntegrate Spark throughout portfolio
3,500 employees working on Spark-related topicsSpark however customers want it – standalone, platform or products
Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss
Launch Spark Technology Cluster (STC), 300 engineers
Open source SystemMLPartner with databricks
Contribute to the Core
©2015 IBM Corporation
Spark MLLib‣Extension to the Spark Core API that provide a library of easy to use
Machine learning algorithms.‣Highly scalable: Leverages Spark ability to work with massive amount of
data‣Fast: Designed for parallel computing‣Cover common Machine Learning algorithms:
- Regression- Classification- Clustering- Recommender Systems- Text Analytics
©2015 IBM Corporation
What is Machine Learning and where is it used‣Subfield of computer science that focuses on getting computers to
learn from data:- Recognize patterns- Make predictions
‣Example use:- Spam filters- Netflix recommendations- Self-driving cars- Watson- …
©2015 IBM Corporation
Typical Machine Learning Flow diagram
Data Acquisition
Data Preparation
Data Annotation
(Ground Truth)
Model Training
• Cleansing• Shaping• Enrichment
Model Testing
Training Set
TestSet
BlindSet
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
©2015 IBM Corporation
MLLib Algorithm Overview• Predictive analytics• Recommendations
• Collaborative Filtering• Matrix Factorization
• Feature extraction and Transformation• TF-IDF• HashingTF• Word2Vec• StandardScaler• Normalizer
• Model Evaluation/Metrics• Binary Classification Metrics• Multi Class Metrics• Regression Metrics
©2015 IBM Corporation
Predictive analyticsContinuous Output Discrete Output
Supervised Learning
(require Ground-Truth)
• Regression - Linear - Ridge - Lasso - Isotonic• Decision Tree• RandomForest• GradientBoostedTree
• Classification - Logistic Regression - SVM - NaiveBayes• Decision Tree• RandomForest• GradientBoostedTree• K-NN (available as add-on spark
package)
Unsupervised Learning
(no Ground-Truth data required)
• Clustering - KMeans - Gaussian Mixture• Dimensionality Reduction - PCA - SVD
• FP-Growth
©2015 IBM Corporation
Featured demo: Flight Delay Predictor‣Use training data collected from flight stats and enriched with weather observations
from “Insight for Weather” service on Bluemix ‣Train multi-class classifier that, given and flight departure weather observations,
can predict the flight delay class:- 0 = Canceled- 1 = On Time- 2 = Delay less than 2 hours- 3 = Delay between 2 and 4 hours- 4 = Delay more than 4 hours
‣Provide metrics measurement for each algorithms- Accuracy- Precision- Recall
©2015 IBM Corporation
Architecture
Weather
Simple Data Pipes
Airports
Flight Schedules
Flight StatusFlightstats Cloudant
Metadata Training Set
Test Set
Blind Set
Custom Connector run every 24 hours
Notebook
©2015 IBM Corporation
Get‣ Identify data sources:
- flightstats.com: https://developer.flightstats.com- Airport metadata: FS Code, geolocation,…- Flight Schedules- Flight Status
- Weather Observations- Insight for Weather on Bluemix
‣ Storage:- Cloudant
‣ Tool used:- Simple Data Pipes custom connector to build Training, Test and Blind data set
‣ Constraints:- Weather service provide past observations as far as 24 hours back only- Flightstats API key is a 30 day trial version, limited to 20,000 calls only
©2015 IBM Corporation
Custom Pipes Connector to build training data sethttps://developer.ibm.com/clouddataservices/simple-
data-pipe/
©2015 IBM Corporation
Run every 24 hoursBecause Weather service doesn’t return observations older than 24 hours, the data set must be ran every 24 hours
©2015 IBM Corporation
Build: Explore the data with Notebook
©2015 IBM Corporation
Loading training data set
©2015 IBM Corporation
Build: Visualize and explore data setScatter plot of flights delays based on temperature in Departing and Arrival airports
©2015 IBM Corporation
Build: Visualize and explore data setScatter plot of flights delays based on wind speed in Departing and Arrival airports
©2015 IBM Corporation
Constraints‣Past weather observations provided by the “Insight for Weather” service have more
details than forecast data:- Limit the number of features used to train the models to the intersections of the 2.
‣Restrict the training data to weather forecast at departure and arrival airport- Would adding weather data from various point in the route increase the model performance?
‣Difficult to get enough representative data because I was using a trial account on flightstats- Ideally, I would use more airports with better representative weather
‣Didn’t use any categorical features‣For simplicity: Use IPython Notebook as the user interface
- Make the experience less compelling for Business users- To avoid writing too much code in the Notebook, encapsulate some of the business logic in a Python
library- Doesn’t cover as much of the Spark API as Scala
©2015 IBM Corporation
Load labeled data RDD
©2015 IBM Corporation
Load labeled data RDD
©2015 IBM Corporation
Build: NaiveBayes Classification
©2015 IBM Corporation
Build: Decision Tree classification
©2015 IBM Corporation
Build: Random Forest classification
©2015 IBM Corporation
Build: Performance measurementsLoad blind data
©2015 IBM Corporation
Build: Compare metrics between different models
©2015 IBM Corporation
Naïve Bayes vs Decision Tree‣Probabilistic: compute the probability
of a data instance to be in a specific class
‣Assume that each feature (variable) is independent from the others
‣Performance depends on the predictive nature of the features (non predictive features will affect the accuracy)
‣Works well with low amount of training data. Doesn’t need all the possibilities
‣Doesn’t work with categorical features.
‣Non-Probabilistic: partition the data into subsets that best describe the variable
‣The deeper the tree, the better the model fits the data
‣Watch out for overfiting: need to prune the tree
‣Can handle categorical or continuous features
‣No need for input to be scaled or standardized: Set you features and go!
‣Requires a lot of data covering all possibilities
©2015 IBM Corporation
Analyze: Run model
©2015 IBM Corporation
Code: Run Model
©2015 IBM Corporation
If you want to know more
‣https://developer.ibm.com/clouddataservices/
‣https://github.com/ibm-cds-labs/pipes-connector-flightstats
‣http://spark.apache.org/docs/latest/mllib-guide.html
‣https://console.ng.bluemix.net/data/analytics/
©2015 IBM Corporation
Thank you