31
1 ©2015 Talend Inc Accelera’ng RealTime Analy’cs with Spark October 8, 2015

Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

1 ©2015 Talend Inc

Accelera'ng  Real-­‐Time  Analy'cs    with  Spark  October  8,  2015  

Page 2: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

Housekeeping

Audio – Streamed via media player, turn volume up

Submit questions for Q&A via Group Chat widget

Download slides and event materials

Hashtag: #stratahadoop

Page 3: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

3

Your  Speakers  Today    

Sean Owen Director of Data Science Cloudera, EMEA

Yann Delacourt Director, Big Data Product Management Talend

Page 4: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

4

•  Apache  Spark,  its  architecture  and  benefits  •  Spark's  architecture,  deployment  strategies  and  use  cases  •  Spark's  impact  to  data  science,  analy@cs  and  machine  learning  • How  to  move  data  scien@sts'  work  to  IT  produc@on  •  Best  prac@ces  for  large  Spark  deployments  • Mastering  Spark's  complexity  

Agenda  

Page 5: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Accelera@ng  Real-­‐Time  Analy@cs  with  Apache  Spark  Sean  Owen,  Director  of  Data  Science  Cloudera,  EMEA    

Page 6: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

6  ©  Cloudera,  Inc.  All  rights  reserved.  

What  is  Apache  Spark?  

Spark  is  a  general  purpose  computa@onal  framework  with  more  flexibility  than  MapReduce    •  Leverages  distributed  memory  • Full  Directed  Graph  expressions  for  data  parallel  computa@ons  •  Improved  developer  experience  •  Linear  scalability,  Data  Locality  • Fault-­‐tolerance    

Page 7: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

7  ©  Cloudera,  Inc.  All  rights  reserved.  

The  Spark  Ecosystem  &  Hadoop  

Spark  Streaming   MLlib   SparkSQL   GraphX   Data-­‐

frames   SparkR  

STORAGE  HDFS,  HBase  

RESOURCE  MANAGEMENT  YARN  

Spark   Impala   MR   Others  Search  

Page 8: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Apache  Spark  Flexible,  in-­‐memory  data  processing  for  Hadoop  

Easy    Development  

Flexible  Extensible    API  

Fast  Batch  &  Stream  Processing  

•  Rich  APIs  for  Scala,  Java,  and  Python  

 •  Interac@ve  shell  

•  APIs  for  different  types  of  workloads:  •  Batch    •  Streaming  •  Machine  Learning  •  Graph  

•  In-­‐Memory  processing  and  caching  

Page 9: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

9  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Use  Interac@vely  

•  Interac@ve  explora@on  of  data  for  data  scien@sts  •  No  need  to  develop  “applica@ons”  

•  Developers  can  prototype  applica@on  on  live  system  

percolateur:spark srowen$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>

Page 10: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

10  ©  Cloudera,  Inc.  All  rights  reserved.  

Easy  Development  Expressive  API  

•  map

•  filter

•  groupBy

•  sort

•  union

•  join

•  leftOuterJoin

•  rightOuterJoin

•  sample

•  take

•  first

•  partitionBy

•  mapWith

•  pipe

•  save

•  …

•  reduce

•  count

•  fold

•  reduceByKey

•  groupByKey

•  cogroup

•  cross

•  zip

Page 11: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

11  ©  Cloudera,  Inc.  All  rights  reserved.  

Example  Logis@c  Regression  

data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient print “Final w: %s” % w

Page 12: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  Takes  Advantage  of  Memory  

Resilient  Distributed  Datasets  (RDD)  • Memory  caching  layer  that  stores  data  in  a  distributed,  fault-­‐tolerant  cache  

• Can  fall  back  to  disk  when  data-­‐set  does  not  fit  in  memory    

• Created  by  parallel  transforma@ons  on  data  in  stable  storage  • Provides  fault-­‐tolerance  through  concept  of  lineage  

 

Page 13: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Fast  Processing  Using  RAM,  Operator  Graphs  

In-­‐Memory  Caching  •  Data  Par@@ons  read  from  RAM  

instead  of  disk    Operator  Graphs  •  Scheduling  Op@miza@ons  •  Fault  Tolerance  

join  

filter  

groupBy  

B:   B:  

C:   D:   E:  

F:  

Ç√Ω  

map  

A:  

map  

take  

=  cached  par@@on  =  RDD  

Page 14: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Data  Science  Baneries  Included  

MLlib   ML  “Pipelines”  •  Exis@ng,  mature  Spark  ML  subproject  •  Covers  the  basics  well  

•  Decision  trees,  SVM,  LR  •  ALS,  SVD  •  K-­‐means  •  …  and  more  

•  Stand-­‐alone  implementa@ons  •  Algorithms  Only  

•  Beta  “MLlib  2.0”  •  Emulates  scikit-­‐learn  APIs  •  Pipelines,  not  just  algos  

•  Feature  engineering  •  Transforma@on  •  Ensembles  

•  Unified  architecture  •  Spark  1.4+  

Page 15: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

15  ©  Cloudera,  Inc.  All  rights  reserved.  

Faster  Itera@ve  ML  Algorithms  (Data  Fits  in  Memory)  

0  500  1000  1500  2000  2500  3000  3500  4000  

1   5   10   20   30  

Runn

ing  Time(s)  

#  of  Itera'ons  

MapReduce  

Spark  

110  s/itera@on  

First  itera@on  =  80s  Further  itera@ons  1s  due  to  caching  

Page 16: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

16  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  Customer  Use  Cases  Core  Spark   Spark  Streaming  

•  Porvolio  Risk  Analysis  •  ETL  Pipeline  Speed-­‐Up  •  20+  years  of  stock  data  Financial  

Services  

Health  

•  Iden@fy  disease-­‐causing  genes  in  the  full  human  genome  

•  Calculate  Jaccard  scores  on  health  care  data  sets  

ERP  

•  Op@cal  Character  Recogni@on  and  Bill  Classifica@on  

•  Trend  analysis    •  Document  classifica@on  (LDA)  •  Fraud  analy@cs  Data  

Services  

1010  

•  Online  Fraud  Detec@on  Financial  Services  

Health  

•  Incident  Predic@on  for  Sepsis  

Retail  

•  Online  Recommenda@on  Systems  •  Real-­‐Time  Inventory  Management  

Ad  Tech  

•  Real-­‐Time  Ad  Performance  Analysis  

Page 17: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Uni@ng  Spark  and  Hadoop  The  One  Plavorm  Ini@a@ve  Investment  Areas  

Management  Leverage  Hadoop-­‐na@ve  resource  management.  

Security  Full  support  for  Hadoop  security  

and  beyond.  

Scale  Enable  10k-­‐node  clusters.  

Streaming  Support  for  80%  of  common  stream  

processing  workloads.  

Page 18: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Management   Security   Scale   Streaming  •  Spark  on  YARN  Integra@on  •  HBase  integra@on  •  Improved  metrics  for  

monitoring/troubleshoo@ng  •  Dynamic  Resource  Alloca@on  

•  Spark  on  YARN:  •  Container  resizing  •  Dynamic  Resource  

Alloca@on  for  Streaming  •  Simplified  resource  

configura@on  •  Improved  WebUI  for    

debugging    •  Improved  metrics  for  visibility  

into  resource  u@liza@on  •  Smart  auto-­‐tuning  of  job  

parameters    

•  Kerberos  Integra@on  •  HDFS  Sync  (Sentry)  •  Secure  data  at  rest  

•  Secure  data  over  the  wire  •  Audit/Lineage  (Navigator)  •  Spark  PCI  compliance  •  Integra@on  with  Intel’s  

advanced  encryp@on  libraries  •  Enable  column  and  view  level  

security  

•  Revamp  Scheduler  handling  of  node  failure  

•  Sort  based  shuffle  improvements  

•  Task  Scheduling  based  on  HDFS  data  locality  and  caching  

•  Scheduler  improvements  for  performance  at  scale  

•  Stress  test  at  scale  with  mixed  mul@-­‐tenant  workloads  

•  HDFS  DDM  Integra@on  •  Dynamic  resource  u@liza@on  &  

priori@za@on  •  Scale  Spark  History  Server  for  

1000s  of  jobs    

•  Zero  Data  Loss  with  Spark  Streaming  Resilience  

•  Flume  integra@on  •  Ka{a  integra@on  

•  SQL  seman@cs  for  expressing  streaming  jobs  (Business  Users)  

•  New  streaming  specific  API  extensions  

•  Streaming  applica@on  management  (pause,  update,  redeploy)  via  CM  

•  Op@mized  state  updates:  efficient  point  lookups  and  delta  updates    

Detailed  Roadmap:  One  PlaTorm  Ini'a've  =  Completed  Work  

=  Planned  Future  Work  

Page 19: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Spark  is  a  Developer  Framework  

• Spark  means  wri@ng  code  

• And  deploying  it  

• And  monitoring  it  

• Workflow  orchestra@on  is  hard  

• Oozie?  Luigi?  

• Custom  scripts  

 

Data  is  S'll  Fickle  • Data  Quality  is  s@ll  hard  

• Spark  s@ll  can’t  automa@cally  find  and  clean  bad  records  

• Feature  engineering  =  ETL  • Data  Integra@on  is  s@ll  hard  

• Read  /  write  the  right  formats  • “Publish”  to  BI  tools  

The  Bad  News  

Page 20: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

20 ©2015 Talend Inc

Accelera'ng  Real-­‐Time  Analy'cs    with  Spark  Yann  Delacourt,  Director  of  Big  Data  Product  Management  Talend    

Page 21: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

21

APPLICATION  INTEGRATION  

CLOUD  INTEGRATION  

DATA    INTEGRATION  

BIG  DATA  INTEGRATION  

MASTER  DATA  MANAGEMENT  

A Modern Data Platform for All Your Integration Needs

INTEGRATE  ANYTHING.              OPERATE  IN  REAL-­‐TIME.            ACT  WITH  INSIGHT.  

Page 22: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

22

BIG  DATA,  CUSTOMERS  &  SUPPLIERS  

ON-­‐PREMISE  APPS  

CLOUD  APPS      I      IOT  SENSORS      I      CUSTOMERS      I      SUPPLIERS  

DEVELOPER    STUDIO   Web  UI  

DATA  FABRIC  

1st Data Integration Platform on Apache Spark

Page 23: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

23 Benefits:    Make  decisions  faster.  Tremendous  developer  produc@vity.    

•  Visually  develop  jobs  that  run  100%  on  Spark  •  5X  'mes  faster  using  independent  benchmarks  •  10X  developer  produc'vity  gained  over  hand-­‐coding  

Spark  •  100X  faster  with  in-­‐memory  processing  

 

•  Over  100  new  drag-­‐n-­‐drop  Spark  components  •  HDFS,  RDBMS,  NoSQL,  Cloud  Storage,  Transforma@on,  

Messaging,  In-­‐memory  analy@cs  &  machine  learning  recommenda@ons,  and  much  more  

•  In-­‐memory  data  caching  &  “windowed”  computa@ons  •  Click  to  enable  Spark  Streaming  for  real-­‐'me  data  

processing    

•  Convert  Talend  MapReduce  jobs  to  Spark  with  the  click  of  a  bunon,  future  proofing  your  investment  

Introducing  Talend  Real-­‐'me  Big  Data  1st  Data  Integra@on  Plavorm  on  Spark  

Page 24: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

24  Benefits:  Developer  produc@vity.  Business  agility.  

Enabling  Intelligent  Data  Pipelining  

Lambda  Architecture:  Batch,  Real-­‐'me,  Query  

•  A  single  solu'on  to  address    •  Bulk/batch  •  Real-­‐@me  •  Streaming  &  IoT  data  •  Machine  Learning  

 •  Provides  Fast  Data  access  through  NoSQL      

•  One  tool  for  Hadoop,  Spark,  tradi@onal  ETL/ELT  and  NoSQL  integra@on  

Speed  Layer  

Batch  Layer  

NoSQL  

IOT  

Web  Logs  

ERP  

DBMS/EDW  

Legacy  

Real-Time Views ____________

Pre-computed

Views

Serving  Layer   Query

Incremental  Data  

All  Data  

Sliding  Window  Analy'cs  

 Apply  Learning  

 Learning  on      past  Data  

Page 25: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

25

Easily  Convert  MapReduce  to  Spark!  

Your  Job  Now  5X  Faster  

MapReduce  (runs  on  disk)  

Spark  (runs  on  disk  and  in-­‐memory)  

One  Click  

Page 26: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

26

Spark/Talend  Enabled  Use  Cases  -­‐  Examples  

Data Discovery (Interactive)

Better Decisions (Batch)

Real-Time Action (Streaming and Machine

Learning)

Digital Economy

Web Analytics Click-Stream Analysis

Real-Time Web Traffic Optimization (retargetting &

reco)

Retail SCM Analytics Find Purchase Corellation

Real-Time Promotion & Coupon Optimization

Financial Services

EDW

Fraud Detection Learning on

Massive Data Volume

High-Scalable Trading, Risk Management & Real-Time

Fraud Detection

Page 27: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

27

Talend  Success  Challenge:    •  Ever  increasing  Big  Data  velocity  •  Many  last  minute  cart  abandonments  

•  Hard  to  op@mize  pricing  

Why  Talend:  •  Is  the  central  integra@on  tool  within  their  Business  Intelligence  

(BI)  organiza@on.    •  Integrates  clickstreams  from  last  6  months  

Value:    •  Le}over  merchandise  reduced  by  20%  •  Can  predict  abandoned  shopping  cart  in  real-­‐@me  with  a  90%  

accuracy    •  Op@mize  Pricing  and  Stock  pricing  

Page 28: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

28

Challenge:    •  Needed  to  migrate  800  ETL  jobs  to  an  “Industrial  Internet”    •  Improve  service  levels  by  providing  data  and  analy@cs  in  the  cloud  

Industrial  Internet  

Solu'on:  •  Integrate  big  data,  small  data,  and  transac@onal  data  with  high  

quality.  •  Talend  Big  Data,  Data  Quality,  Master  Data  Management  

Value:    •  Provide  a  collabora@ve,  prescrip@ve,  and  predic@ve  environment    •  Improved  customer  sa@sfac@on,  improved  produc@vity  per  

turbine  •  Predict  failures  &  Reduce  inventory  •  Arm  sales  with  compe@@ve  intelligence  

Page 29: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

29

From  Zero  to  Big  Data  in  10  Minutes  Download  free  www.talend.com/download  

•  Get up and running in minutes, not weeks, with a big data Sandbox and demos

•  Includes: Sentiment analysis, ETL Offload, Log file analysis, Recommendation engine

•  Start working with Talend, Hadoop & NoSQL today!

Now with

Page 30: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

‹#› © 2015 Cloudera, Inc. All rights reserved.

The conference for and by Data Scientists, from startup to enterprise wrangleconf.com

Public registration is now open!

  Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more

  When: Thursday, October 22, 2015   Where: Broadway Studios, San Francisco

Page 31: Accelerating Real-time Analytics with Spark 10082015€¦ · 1 ©2015 Talend Inc Accelera’ng*Real,Time*Analy’cs** with*Spark October(8,(2015

31  ©  Cloudera,  Inc.  All  rights  reserved.  ©2015 Talend Inc

Q&A