Power of data. Simplicity of design. Speed of innovation. · What Spark isn’t A data store –Spark attaches to many data stores but does not provide its own Only for Hadoop –Spark

© 2016 IBM Corporation

Carlo Appugliese

Big Data Evangelist

[email protected]

www.linkedin.com/in/carloappugliese

Power of data. Simplicity of design. Speed of innovation.

Pijush Chatterjee

Big Data Evangelist

[email protected]

https://www.linkedin.com/in/pijushchatterjee

2 © 2016 IBM Corporation

The digital age is changing the way we

Live, Play, Learn and Work…


Landing and

Archive Zone

Real-time

Analytics

Zone

Enterprise Warehouse

and Mart Zone

Information Governance, Security and Business Continuity

Analytic Appliances

Big Data Platform Capabilities

Streaming Data

Text Data

Applications Data

Time Series

Geo Spatial

Relational

• Information Ingest

• Real Time Analytics

• Warehouse & Data Marts

• Analytic Appliances

Social Network

Video &

Image

All Data Sources

Advanced Analytics /

New InsightsNew / Enhanced

Applications

Automated Process

Case Management

Analytic Applications

CognitiveLearn Dynamically?

PrescriptiveBest Outcomes?

PredictiveWhat Could Happen?

DescriptiveWhat Has Happened?

Exploration and

DiscoveryWhat Do You Have?

Watson

Cloud Services

ISV Solutions

Alerts

Big Data and analytics sample architecture

Ingestion

and

Operational

Information


Apache Spark

Apache Spark is enabling competitive advantage


What is Spark?

Spark is an open source

in-memory

application framework for

distributed data processing and

iterative analysis

on massive data volumes

“Analytic Operating System”


Apache Spark has no limits!!

Spark Application

(Driver Program)

SparkContext

Local Threads

Swift (SoftLayer), AWS S3, HDFS, or other storage

Cluster Manager

Worker Node Worker Node

Spark

ExecutorSpark

Executor

Cache Cache

• Spark programs generally consist of two

components: driver program and worker

program(s)

– Driver Program manages the division of

computations (Task) that are sent to worker

nodes

– Worker programs run smaller portions of

computations

• The SparkContext object instructs Spark on

how & where to access a cluster

• Cluster Manager manages the physical

resources needed to run driver and worker

programs.


Traditional Approach: MapReduce jobs for complex jobs, interactive query,

and online event-hub processing involves lots of (slow) disk I/O

Solution: Keep data in-memory with a new distributed execution engine

Apache Spark is Fast!

HDFS

Read

Input

CPU

Iteration 1

Memory CPU

Iteration 2

Memory

10–100x faster than

network & disk

Minimal

Read/Write Disk

Bottleneck

Chain Job Output

into New Job Input

HDFS

Read

HDFS

Write

HDFS

Read

HDFS

Write

CPU

Iteration 1

Memory CPU

Iteration 2

Memory


What Spark isn’t

A data store – Spark attaches to many data stores but does not provide its own

Only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a separate, standalone system

Only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally as well

Only for batch processing – Spark has streaming capabilities, machine learning and is increasingly being used for end user real-time analytic applications


Spark includes a set of core libraries that enable various

analytic methods which can process data from many sources

Spark Core Engine

general compute

engine, handles

distributed task

dispatching,

scheduling and basic

I/O functions

Spark SQLSpark

Streaming

MLlib

(machine

learning)

GraphX

(graph)

executes SQL

statements

performs

streaming

analytics using

micro-batches

common

machine

learning and

statistical

algorithms

distributed

graph

processing

framework

large variety of

data sources and

formats can be

supported, both on

premise or cloud

BigInsights

(HDFS)

Cloudant

dashDB

Object

Storage

SQL

DB

…many

others

IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE


Spark Programming Languages

Scala– Functional programming

– Spark written in Scala

– Scala compiles into Java byte code

Java– New features in Java 8 makes for more

compact coding (lambda expressions)

Python– Most widely used API with Spark today

R– Functional programming language used to create and manipulate functions

This probably means that more “data scientists” are starting to use Spark

DataFrames make all languages equally performant

Language 2014 2015

Scala 84% 71%

Java 38% 31%

Python 38% 58%

R unknown 18%

Survey done by Databricks,

Summer 2015


What is a “Notebook”?

Pen and Paper Pen and paper has long provided the rich

experience that scientists need to document

progress through notes and drawings:– Expressive

– Cumulative

– Collaborative

Notebooks Notebooks are the digital equivalent of the

“pen and paper” lab notebook, enabling data

scientists to document reproducible

analysis:– Markdown and visualization

– Iterative exploration

– Easy to share


Open Source Web-Based Notebooks

Notebooks:

“interactive computational environment, in which you can combine code

execution, rich text, mathematics, plots and rich media”

Zeppelin– Classified by Apache with a status of “incubator”

– Current version: 0.5.5 (Nov 18, 2015)

– Support multiple interpreters• Scala, Python, SparkSQL

Jupyter– Based on IPython

– Current version: 4.1.0b1

– Supports multiple interpreters• Python, Scala


Key reasons for interest in Spark

Performant In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities


Spark Common Use Cases

•Enterprise-scale data volumes accessible to interactive query for business intelligence (BI)

•Faster time to job completion allows analysts to ask the “next” question about their data & business

Interactive Query

•Data cleaning to improve data quality (missing data, entity resolution, unit mismatch, etc.)

•Nightly ETL processing from production systemsLarge-Scale Batch

Forecasting vs. “Nowcasting” (e.g. Google Search queries analyzed en masse for Google Flu Trends to predict outbreaks)

Data mining across various types of data

Complex Analytics

•Web server log file analysis (human-readable file formats that are rarely read by humans) in near-real time

•Responsive monitoring of RFID-tagged devices

Event Processing

•Predictive modeling answers questions of "what will happen?"

•Self-tuning machine learning, continually updating algorithms, and predictive modeling

Model Building


IBM is all-in on Spark

Launch Spark Technology Cluster (STC), 300 engineers

Open source SystemML

Partner with Databricks

Contribute to the Core

Foster Community

Educate 1M+ data scientists and engineers via online

courses

Sponsor AMPLab, creators and evangelists of Spark

Infuse the Portfolio

Integrate Spark throughout portfolio

3,500 employees working on Spark-related topics

Spark however customers want it – standalone, platform or products

"It's like Spark just got

blessed by the enterprise

rabbi."

Ben Horowitz,Andreessen Horowitz


IBM w/ Spark empowers all Data Professionals Who

Are Hungry to Put Data to Work

Business Analysts

Data ScienceApp Developers

Data Engineers

Easily discover and

explore data to

improve decisions

Tame, curate and secure

data to make it relevant

and accessible.

Easily plug into data

and models to make

apps more powerful

Streamline algorithm

development to

deliver insight faster


IBM Announcement - Watson Data Platform Project

ibm.co/dataworks

Community-centric User Experiences– Fit for purpose tooling for each data professional

– Bound together through collaboration and integrated platform

Cloud-Based Data and Analytics Services – Cloud based data and analytics services

– Built on Open Source Technologies

– Supplemented w/ enterprise and cognitive capabilities

Solution Blueprints – Set of data and analytics offerings integrated together to solve

specific use cases

Business

Analyst

Data

Scientist

App

Developer

Data

Engineer


Tailored Experiences for Users Collaborating Together

Architects how data is

organized & ensures operability

Gets deep into the data to draw

hidden insights for the business

Works with data to apply insights

to the business strategy

Plugs into data and models &

writes code to build apps

Ingest

data

Transform

: clean

Create

and build

model

Evaluate

Deliver

and deploy

model

Communicate

results

Understand

problem and

domain

Explore and

understand

data

Transform:

shape

OUTPUT

ANALYSIS

INPUT

(Beta)

Data Engineer

Data Scientist

Business Analyst

App Developer

IBM Bluemix Data Connect

Data Science Experience

Analytics for Apache Spark, SPSS

Watson Analytics

Bluemix


IBM DataWorks Provides Choice of Collaborative User

Experiences, Solution Blueprints, and Individual Services

Access

& Ingest

Find Share Collaborate

StoreAnalyze

& BuildDeploy

• IOT

• Streaming

• Data Preparation

• ETL/ELT

• Hadoop

• NoSQL/SQL

• Object Store

• Descriptive

• Predictive

• Prescriptive

• Dev. environment

• Apps/APIs

• Data Pipelines

• Reports

• Models

Solution

BlueprintsSelf-Service

Analytics

Internet of

Things

Data

Lake

Mobile

Applications

User

Experiences

Individual

Services

Powered by

Governance

Data Access

Data Recognition

Advanced Analytics


Data Professional Evolution – Data Science!

Statistician

Programmer Business Expert

ETL using SED/AWK

Scripting, SQL

Data Formats

Python, R Scala

Mathematical Background

Computational Science

Business/Industry Expertise

Domain Knowledge

Supply Chain

CRM

Financials

Networking

Data Science Combines Skills across areas of Expertise

Data Science Today vary in combinations of these skills


Benefits of Spark for Data Science

Spark is Easy… Allows Data Scientist to code at scale…

Support multiple programing interfaces (Scala, Python, Java and R)

Less lines of Code to get answers.

Spark is Agile… Allows Data Scientist to build pipelines, learn and get answers...

Unified APIs (SQL, DataFrames, Streaming, Machine Learning, etc.)

Supports Notebooks (Jupyter, Zeppelin, Etc.)

Spark is Fast… Allows Data Scientist to iterate quicker...

In-Memory processing that scales in a distributed architecture.

Application follows lazy evaluations architecture w/ optimized execution.

General compute engine

Basic I/O functions

Task dispatching

SchedulingSpark Core

Spark SQLSpark

StreamingMLlib

Machine Learning

GraphX Graphing

+


Built-in learning to

get started or go

the distance with

advanced tutorials

Learn

The best of open source

and IBM value-add to

create state-of-the-art

data products

Create

Community and

social features that

provide meaningful

collaboration

Collaborate

URL: http://datascience.ibm.com

Supporting Data Scientist…

Introducing the Data Science ExperienceCurrently in Beta

Powered by

http://datascience.ibm.com/


A L L Y O U R T O O L S I N O N E P L A C E

IBM Data Science Experience is an environment that brings

together everything that a Data Scientist needs. It includes the

most popular Open Source tools and IBM unique value-add

functionalities with community and social features, integrated

as a first class citizen to make Data Scientists more successful.

datascience.ibm.co

m

IBM Data Science Experience


IBM Data Science Experience

Community Open Source IBM Added Value

Powered by IBM DataWorks in the Cloud

• Find tutorials and datasets

• Connect with Data Scientists

• Ask questions

• Read articles and papers

• Fork and share projects

• Code in Scala/Python/R/SQL

• Jupyter and Zeppelin* Notebooks

• RStudio IDE and Shiny apps

• Apache Spark

• Your favorite libraries

• Data Shaping/Pipeline UI *

• Auto-data preparation

and modeling*

• Advanced Visualizations*

• Model management

and deployment*

• Documented Model APIs*

• Spark as a Service

* DSX product roadmap items

Core Attributes of the Data Science Experience


https://youtu.be/1HjzkLRdP5k


* DSX product roadmap items

Demonstration

Documents

Power of data. Simplicity of design. Speed of innovation. · What Spark isn’t A data store –Spark attaches to many data stores but does not provide its own Only for Hadoop –Spark