Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ngReal,TimeAnaly’cs** with*Spark October(8,(2015

1 ©2015 Talend Inc

Accelera'ng Real-‐Time Analy'cs with Spark October 8, 2015

Housekeeping

Audio – Streamed via media player, turn volume up

Submit questions for Q&A via Group Chat widget

Download slides and event materials

Hashtag: #stratahadoop

3

Your Speakers Today

Sean Owen Director of Data Science Cloudera, EMEA

Yann Delacourt Director, Big Data Product Management Talend

4

•  Apache Spark, its architecture and benefits •  Spark's architecture, deployment strategies and use cases •  Spark's impact to data science, analy@cs and machine learning • How to move data scien@sts' work to IT produc@on •  Best prac@ces for large Spark deployments • Mastering Spark's complexity

Agenda

5 © Cloudera, Inc. All rights reserved.

Accelera@ng Real-‐Time Analy@cs with Apache Spark Sean Owen, Director of Data Science Cloudera, EMEA


What is Apache Spark?

Spark is a general purpose computa@onal framework with more flexibility than MapReduce •  Leverages distributed memory • Full Directed Graph expressions for data parallel computa@ons •  Improved developer experience •  Linear scalability, Data Locality • Fault-‐tolerance


The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-‐

frames SparkR

STORAGE HDFS, HBase

RESOURCE MANAGEMENT YARN

Spark Impala MR Others Search


Apache Spark Flexible, in-‐memory data processing for Hadoop

Easy Development

Flexible Extensible API

Fast Batch & Stream Processing

•  Rich APIs for Scala, Java, and Python

•  Interac@ve shell

•  APIs for different types of workloads: •  Batch •  Streaming •  Machine Learning •  Graph

•  In-‐Memory processing and caching


Easy Development Use Interac@vely

•  Interac@ve explora@on of data for data scien@sts •  No need to develop “applica@ons”

•  Developers can prototype applica@on on live system

percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>


Easy Development Expressive API

•  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  sample

•  take

•  first

•  partitionBy

•  mapWith

•  pipe

•  save

•  …

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip


Example Logis@c Regression

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w


Spark Takes Advantage of Memory

Resilient Distributed Datasets (RDD) • Memory caching layer that stores data in a distributed, fault-‐tolerant cache

• Can fall back to disk when data-‐set does not fit in memory

• Created by parallel transforma@ons on data in stable storage • Provides fault-‐tolerance through concept of lineage


Fast Processing Using RAM, Operator Graphs

In-‐Memory Caching •  Data Par@@ons read from RAM

instead of disk Operator Graphs •  Scheduling Op@miza@ons •  Fault Tolerance

join

filter

groupBy

B: B:

C: D: E:

F:

Ç√Ω

map

A:

map

take

= cached par@@on = RDD


Data Science Baneries Included

MLlib ML “Pipelines” •  Exis@ng, mature Spark ML subproject •  Covers the basics well

•  Decision trees, SVM, LR •  ALS, SVD •  K-‐means •  … and more

•  Stand-‐alone implementa@ons •  Algorithms Only

•  Beta “MLlib 2.0” •  Emulates scikit-‐learn APIs •  Pipelines, not just algos

•  Feature engineering •  Transforma@on •  Ensembles

•  Unified architecture •  Spark 1.4+


Faster Itera@ve ML Algorithms (Data Fits in Memory)

0 500 1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Runn

ing Time(s)

# of Itera'ons

MapReduce

Spark

110 s/itera@on

First itera@on = 80s Further itera@ons 1s due to caching


Cloudera Customer Use Cases Core Spark Spark Streaming

•  Porvolio Risk Analysis •  ETL Pipeline Speed-‐Up •  20+ years of stock data Financial

Services

Health

•  Iden@fy disease-‐causing genes in the full human genome

•  Calculate Jaccard scores on health care data sets

ERP

•  Op@cal Character Recogni@on and Bill Classifica@on

•  Trend analysis •  Document classifica@on (LDA) •  Fraud analy@cs Data

Services

1010

•  Online Fraud Detec@on Financial Services

Health

•  Incident Predic@on for Sepsis

Retail

•  Online Recommenda@on Systems •  Real-‐Time Inventory Management

Ad Tech

•  Real-‐Time Ad Performance Analysis


Uni@ng Spark and Hadoop The One Plavorm Ini@a@ve Investment Areas

Management Leverage Hadoop-‐na@ve resource management.

Security Full support for Hadoop security

and beyond.

Scale Enable 10k-‐node clusters.

Streaming Support for 80% of common stream

processing workloads.


Management Security Scale Streaming •  Spark on YARN Integra@on •  HBase integra@on •  Improved metrics for

monitoring/troubleshoo@ng •  Dynamic Resource Alloca@on

•  Spark on YARN: •  Container resizing •  Dynamic Resource

Alloca@on for Streaming •  Simplified resource

configura@on •  Improved WebUI for

debugging •  Improved metrics for visibility

into resource u@liza@on •  Smart auto-‐tuning of job

parameters

•  Kerberos Integra@on •  HDFS Sync (Sentry) •  Secure data at rest

•  Secure data over the wire •  Audit/Lineage (Navigator) •  Spark PCI compliance •  Integra@on with Intel’s

advanced encryp@on libraries •  Enable column and view level

security

•  Revamp Scheduler handling of node failure

•  Sort based shuffle improvements

•  Task Scheduling based on HDFS data locality and caching

•  Scheduler improvements for performance at scale

•  Stress test at scale with mixed mul@-‐tenant workloads

•  HDFS DDM Integra@on •  Dynamic resource u@liza@on &

priori@za@on •  Scale Spark History Server for

1000s of jobs

•  Zero Data Loss with Spark Streaming Resilience

•  Flume integra@on •  Ka{a integra@on

•  SQL seman@cs for expressing streaming jobs (Business Users)

•  New streaming specific API extensions

•  Streaming applica@on management (pause, update, redeploy) via CM

•  Op@mized state updates: efficient point lookups and delta updates

Detailed Roadmap: One PlaTorm Ini'a've = Completed Work

= Planned Future Work


Spark is a Developer Framework

• Spark means wri@ng code

• And deploying it

• And monitoring it

• Workflow orchestra@on is hard

• Oozie? Luigi?

• Custom scripts

Data is S'll Fickle • Data Quality is s@ll hard

• Spark s@ll can’t automa@cally find and clean bad records

• Feature engineering = ETL • Data Integra@on is s@ll hard

• Read / write the right formats • “Publish” to BI tools

The Bad News

20 ©2015 Talend Inc

Accelera'ng Real-‐Time Analy'cs with Spark Yann Delacourt, Director of Big Data Product Management Talend

21

APPLICATION INTEGRATION

CLOUD INTEGRATION

DATA INTEGRATION

BIG DATA INTEGRATION

MASTER DATA MANAGEMENT

A Modern Data Platform for All Your Integration Needs

INTEGRATE ANYTHING. OPERATE IN REAL-‐TIME. ACT WITH INSIGHT.

22

BIG DATA, CUSTOMERS & SUPPLIERS

ON-‐PREMISE APPS

CLOUD APPS I IOT SENSORS I CUSTOMERS I SUPPLIERS

DEVELOPER STUDIO Web UI

DATA FABRIC

1st Data Integration Platform on Apache Spark

23 Benefits: Make decisions faster. Tremendous developer produc@vity.

•  Visually develop jobs that run 100% on Spark •  5X 'mes faster using independent benchmarks •  10X developer produc'vity gained over hand-‐coding

Spark •  100X faster with in-‐memory processing

•  Over 100 new drag-‐n-‐drop Spark components •  HDFS, RDBMS, NoSQL, Cloud Storage, Transforma@on,

Messaging, In-‐memory analy@cs & machine learning recommenda@ons, and much more

•  In-‐memory data caching & “windowed” computa@ons •  Click to enable Spark Streaming for real-‐'me data

processing

•  Convert Talend MapReduce jobs to Spark with the click of a bunon, future proofing your investment

Introducing Talend Real-‐'me Big Data 1st Data Integra@on Plavorm on Spark

24 Benefits: Developer produc@vity. Business agility.

Enabling Intelligent Data Pipelining

Lambda Architecture: Batch, Real-‐'me, Query

•  A single solu'on to address •  Bulk/batch •  Real-‐@me •  Streaming & IoT data •  Machine Learning

•  Provides Fast Data access through NoSQL

•  One tool for Hadoop, Spark, tradi@onal ETL/ELT and NoSQL integra@on

Speed Layer

Batch Layer

NoSQL

IOT

Web Logs

ERP

DBMS/EDW

Legacy

Real-Time Views ____________

Pre-computed

Views

Serving Layer Query

Incremental Data

All Data

Sliding Window Analy'cs

Apply Learning

Learning on past Data

25

Easily Convert MapReduce to Spark!

Your Job Now 5X Faster

MapReduce (runs on disk)

Spark (runs on disk and in-‐memory)

One Click

26

Spark/Talend Enabled Use Cases -‐ Examples

Data Discovery (Interactive)

Better Decisions (Batch)

Real-Time Action (Streaming and Machine

Learning)

Digital Economy

Web Analytics Click-Stream Analysis

Real-Time Web Traffic Optimization (retargetting &

reco)

Retail SCM Analytics Find Purchase Corellation

Real-Time Promotion & Coupon Optimization

Financial Services

EDW

Fraud Detection Learning on

Massive Data Volume

High-Scalable Trading, Risk Management & Real-Time

Fraud Detection

27

Talend Success Challenge: •  Ever increasing Big Data velocity •  Many last minute cart abandonments

•  Hard to op@mize pricing

Why Talend: •  Is the central integra@on tool within their Business Intelligence

(BI) organiza@on. •  Integrates clickstreams from last 6 months

Value: •  Le}over merchandise reduced by 20% •  Can predict abandoned shopping cart in real-‐@me with a 90%

accuracy •  Op@mize Pricing and Stock pricing

28

Challenge: •  Needed to migrate 800 ETL jobs to an “Industrial Internet” •  Improve service levels by providing data and analy@cs in the cloud

Industrial Internet

Solu'on: •  Integrate big data, small data, and transac@onal data with high

quality. •  Talend Big Data, Data Quality, Master Data Management

Value: •  Provide a collabora@ve, prescrip@ve, and predic@ve environment •  Improved customer sa@sfac@on, improved produc@vity per

turbine •  Predict failures & Reduce inventory •  Arm sales with compe@@ve intelligence

29

From Zero to Big Data in 10 Minutes Download free www.talend.com/download

•  Get up and running in minutes, not weeks, with a big data Sandbox and demos

•  Includes: Sentiment analysis, ETL Offload, Log file analysis, Recommendation engine

•  Start working with Talend, Hadoop & NoSQL today!

Now with

‹#› © 2015 Cloudera, Inc. All rights reserved.

The conference for and by Data Scientists, from startup to enterprise wrangleconf.com

Public registration is now open!

  Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more

  When: Thursday, October 22, 2015   Where: Broadway Studios, San Francisco

31 © Cloudera, Inc. All rights reserved. ©2015 Talend Inc

Q&A

Documents

Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ngReal,TimeAnaly’cs** with*Spark October(8,(2015