36
Spark 2.0: What’s Next Reynold Xin @rxin Spark Conference Japan Feb 8, 2016

Spark 2.0: What’s Next - GitHub Pages

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spark 2.0: What’s Next - GitHub Pages

Spark 2.0: What’s Next

Reynold Xin @rxinSpark Conference JapanFeb 8, 2016

Page 2: Spark 2.0: What’s Next - GitHub Pages

Please put up your hand

if you know what Spark is?

Page 3: Spark 2.0: What’s Next - GitHub Pages

Put up your hand

if you think your significant otherknow what Spark is?

(girlfriend, boyfriend, wife, husband, …)

Page 4: Spark 2.0: What’s Next - GitHub Pages

This Talk

What is Spark?How are people using it?Spark 2.0

Page 5: Spark 2.0: What’s Next - GitHub Pages

open source data processing engine built around speed, ease of use, and sophisticated analytics�����������!#��� Ò�Á�Ñ´¶

������������"� �

Page 6: Spark 2.0: What’s Next - GitHub Pages
Page 7: Spark 2.0: What’s Next - GitHub Pages
Page 8: Spark 2.0: What’s Next - GitHub Pages

About Databricks

Founded by creators of Spark & behind Spark development

Cloud Enterprise Spark Platform• Cluster management, interactive notebooks,

dashboards, production jobs,data governance, security, …

Databricks Àº¥»

Spark�ht½Spark�hÒA>³Î�¶¸À˹»�l°Ï¶

ØďêđüĉÕçÝĉÖñ Spark üĉíðúÙđăĐÝĉæên` .�$ôđðûíÝ

ĐëíäĆÿđñ åĈûb<

ĐïđêÛöòďæ èÜĆĊîÔ

Page 9: Spark 2.0: What’s Next - GitHub Pages

2015: Great Year for Spark

Most active open source project in data (1000+ contributors)

New language: R

Widespread industry support & adoption

2015: SparkÀ½¹»'«¿3

ïđê� IÉYh¿Úđüďéđæüčå×Ýð (1000��Â�^t)

D²¥�� : R

24¥ReâĀđð½@c

Page 10: Spark 2.0: What’s Next - GitHub Pages

Meetup Groups: December 2014

source: meetup.com

Page 11: Spark 2.0: What’s Next - GitHub Pages

Meetup Groups: December 2015

source: meetup.com

TokyoSpark Meetup

Page 12: Spark 2.0: What’s Next - GitHub Pages

IBMÃApache Spark Â�6�ÅÂáĂíðĄďðÒÓòÖďæ TÂ103¼IÉ�~¿Úđüďéđæüčå×Ýð½¿Î�w;ÒkÈ»¥Î½¥¦

Spark Or Hadoop – ¾¸Ìªýæð¿øíÞïđêúČđăĎđÝē

Apache Spark ª�W:�G·½�PqOªj³

Page 13: Spark 2.0: What’s Next - GitHub Pages

“Spark is the Taylor Swiftof big data software.”

- Derrick Harris, Fortune¢Spark ÃøíÞïđêéúðÖ×ÓÂîÕĉđĐæÖÔúð�·£

- ëđĊíÝĐõĊæ úÙđìĆď

��X : FM¼ÃîČøfp¢îĉæõÖæ£Â��U¿¾¼K

Page 14: Spark 2.0: What’s Next - GitHub Pages

“Spark is the �1H(of big data software.”

(A Japanese engineer told me)

Page 15: Spark 2.0: What’s Next - GitHub Pages

How are people using Spark?

Page 16: Spark 2.0: What’s Next - GitHub Pages

Diverse Runtime EnvironmentsHOW RESPONDENTS ARE

RUNNING SPARK

51%on a public cloud

MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)

48% 40% 11%Standalone mode YARN Mesos

Cluster Managers

°Ç±Ç¿+{a%

Page 17: Spark 2.0: What’s Next - GitHub Pages

Industries Using Spark

Other

Software(SaaS, Web, Mobile)

Consulting (IT)Retail,

e-Commerce

Advertising,Marketing, PR

Banking, Finance

Health, Medical,Pharmacy, Biotech

Carriers,Telecommunications

Education

Computers, Hardware

29.4%

17.7%

14.0%

9.6%

6.7%

6.5%

4.4%

4.4%

3.9%

3.5%

SparkÒ�c²»¥ÎRe

éúðÖ×Ó

áďâċîÔďÞ (IT)

�{ �z

áďùĆđêđ õđñÖ×Ó

Bv

�7 �g y� öÕÚîÝôčåđ

ÜąĊÓ ��

4" 

āđßîÔďÞ PR

/& eáāđæ

µÂ�

Page 18: Spark 2.0: What’s Next - GitHub Pages

Top Applications

29%

36%

40%

44%

52%

68%

Fraud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business IntelligenceøåóæÕďîĊå×ďæ

ïđêÖ×ÓõÖåďÞ

ČáĄďïđäĈď

čÞ�`

ćđãđ!­âđøæ

�VQ� / èÜĆĊîÔ

��ÂÓüĊßđäĈď

Page 19: Spark 2.0: What’s Next - GitHub Pages

Are we done?

No. Development is faster than ever!

ɦ)<ē

¥¥§¡�hÃ�Ǽ�ÀYhÀ¿¹»r¥»¥ÎĒ

Page 20: Spark 2.0: What’s Next - GitHub Pages

2012

started@

Berkeley

2010

researchpaper

2013

Databricksstarted

& donatedto ASF

2014

Spark 1.0 & libraries(SQL, ML, GraphX)

2015

DataFramesTungsten

ML Pipelines

2016

Spark 2.0

Page 21: Spark 2.0: What’s Next - GitHub Pages

SQL Streaming MLlib

Spark Core (RDD)

GraphX

Spark stack diagram SparkÂæêíÝ#

Page 22: Spark 2.0: What’s Next - GitHub Pages

Frontend(user facing APIs)

Backend(execution)

Spark stack diagram(a different take)

SparkÂæêíÝ#(�¦�E¼)

������

(ćđãđÀ�³ÎAPI)

öíÝØďñ

(+{)

Page 23: Spark 2.0: What’s Next - GitHub Pages

Frontend(RDD, DataFrame, ML pipelines, …)

Backend(scheduler, shuffle, operators, …)

Spark stack diagram(a different take)

SparkÂæêíÝ#(�¦�E¼)

������

(RDD, DataFrame, ML pipelines, …)

öíÝØďñ

(æßåĆđĉ äąíúċ [m( 

…)

Page 24: Spark 2.0: What’s Next - GitHub Pages

FrontendAPI Foundation

Streaming DataFrame/Dataset

SQL

Backend10X Performance

Whole-stage CodegenVectorization

Spark 2.0

účďðØďñ API Â��

æðĊđĂďÞ

DataFrame/DatasetSQL

öíÝØďñ 10�Â÷úÙđāďæ

�æîđå áđñb<

ýÝðċ�

Page 25: Spark 2.0: What’s Next - GitHub Pages

Guiding Principles for API Foundation

1. Simple yet expressive

2. (Semantics) well-defined

3. Sufficiently abstracted to allow optimized backends

API Ò�ÎÀ¤¶¹»Â?�

äďüċ·ª|_�©À

(èāďîÔÝæª) ��*s°Ï»¥Î

öíÝØďñÂI��ª¼«Î˦��À=��°Ï»¥Î

Page 26: Spark 2.0: What’s Next - GitHub Pages

Java/Scalafrontend

JVMbackend

RDD

DataFramefrontend

Logical Plan

Physical execution

Catalystoptimizer

DataFrame

Page 27: Spark 2.0: What’s Next - GitHub Pages

Python Java/Scala SQL

DataFrameLogical Plan

JVM Tungsten …

Page 28: Spark 2.0: What’s Next - GitHub Pages

API Foundations in Spark 2.0

1. Streaming DataFrames

2. Maturing and merging DataFrame and Dataset

3. ANSI SQL• natural join, subquery, view support

Spark 2.0 À¨­ÎAPIÂ��

æðĊđĂďÞ DataFrames

DataFrame ½ Dataset Â<]½āđå

x\q� âûÝØĊ øĆđÂâĀđð

Page 29: Spark 2.0: What’s Next - GitHub Pages

Challenges with Stream Processing

Stream processing is hard to reason about• Output over time• Late data• Failures• Distribution

And all this has to work across complex operations• Windows, sessions, aggregation, etc

æðĊđă�`À�³Î��

æðĊđă�`ª�²¥`dÃ

Đ�¥L�ÀZÎÓÖðüíð

Đ�Ï»¬Îïđê

�-

��

®Ï̳ƻª}�¿ÚþČđäĈďÀѶ¹»Sw²¿­ÏÄ¿Ì¿¥

ĐÖÔďñÖ èíäĈď ÓÞĊàđäĈď ¿¾

Page 30: Spark 2.0: What’s Next - GitHub Pages

Next-gen Streaming with DataFrames

1. Easy-to-use APIs (batch, streaming, and interactive)

2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics

3. Leverages Tungsten backend

DataFramesÀËÎT�æðĊđĂďÞ

1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê

�-

Đexactly-once èāďîÔÝæÒ>º source / sink

3. Tungsten öíÝØďñÂ�c

Page 31: Spark 2.0: What’s Next - GitHub Pages

Next-gen Streaming with DataFrames

1. Easy-to-use APIs (batch, streaming, and interactive)

2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics

3. Leverages Tungsten backend

DataFramesÀËÎT�æðĊđĂďÞ

1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê

�-

Đexactly-once èāďîÔÝæÒ>º source / sink

3. Tungsten öíÝØďñÂ�c

More details next few weeksC��9ÀËÍ�oÒ

Page 32: Spark 2.0: What’s Next - GitHub Pages

Spark is already pretty fast.

Can we make it 10X faster in 2.0?

Spark ó¼À©¿Í�¥

2.0¼ 10���À¼«Î·Ц©ē

Page 33: Spark 2.0: What’s Next - GitHub Pages

Spark 1.6 13.95 millionrows/sec

Spark 2.0work-in-progress

125 millionrows/sec

High throughput�æċđüíð

Teaser: SQL/DataFrame Performance

come to my talk this afternoon to learn more�²¬iͶ¥Ë¦¼²¶Ì�9ÂѶ²Â�Òu«ÀN»¬·°¥

0²·­,�: SQL/DataFrame��������

Page 34: Spark 2.0: What’s Next - GitHub Pages

Tungsten Execution

PythonSQL R Streaming

DataFrame (& Dataset)

AdvancedAnalytics

Page 35: Spark 2.0: What’s Next - GitHub Pages

Spark 2.0 Release Schedule

Under active development on GitHub

March – April: code freeze

April – May: official release

Spark 2.0 ÂĊĊđææßåĆđċ

GitHub�¼YhÀ�h�

3J-4J : áđñúĊđç

4J-5J : V8ĊĊđæ

Page 36: Spark 2.0: What’s Next - GitHub Pages

¤Íª½¦¯±¥Ç²¶@rxin