41
PRESENTATION TITLE ON ONE LINE AND ON TWO LINES First and last name Position, company Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkSoftware Engineer, Databricks Burak Yavuz

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

Embed Size (px)

Citation preview

Page 1: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Arbitrary  Stateful Aggregationsusing  Structured  Streaming

in  Apache  Spark™

Software  Engineer,  Databricks

Burak  Yavuz

Page 2: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Burak  Yavuz

2

●Software  Engineer  – Databricks-­‐ “We  make  your  streams  come  true”●Apache  Spark  Committer  as  of  Feb  2017●MS  in  Management  Science  &  Engineering  -­‐Stanford  University●BS  in  Mechanical  Engineering  -­‐ Bogazici University,  Istanbul

Page 3: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

TEAM

About

Started  Spark  project  (now  Apache  Spark)  at  UC  Berkeley  in  2009

PRODUCTUnified  Analytics  Platform

MISSIONMaking  Big  Data  Simple

Page 4: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Outline

oStructured  Streaming  ConceptsoStateful Processing  in  Structured  StreamingoUse  Cases  and  How  NoSQL  Stores  Fit  InoDemos

Page 5: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

The simplest way to perform streaming analyticsis not having to reason about streaming at all

Page 6: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Page 7: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

New  ModelInput:  data  from  source  as  an  append-­‐only table

Trigger:  how  frequently  to  checkinput  for  new  data

Query:  operations  on  inputusual  map/filter/reduce  new  window,  session  ops

Trigger: every 1 sec

1 2 3Time

data upto 1

Input data upto 2

data upto 3

Quer

y

Page 8: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Trigger: every 1 sec

1 2 3

result for data up to 1

Result

Quer

y

Time

data upto 1

Input data upto 2

result for data up to 2

data upto 3

result for data up to 3

Output[complete mode]

output all the rows in the result table

New  Model

Result:  final  operated  table  updated  every  trigger  interval

Output:  what  part  of  result  to  write  to  data  sink  after  every        trigger

Complete  output:  Write  full  result  table  every  time

Page 9: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Trigger: every 1 sec

1 2 3

result for data up to 1

Result

Quer

y

Time

data upto 1

Input data upto 2

result for data up to 2

data upto 3

result for data up to 3

Output[append mode]

output only new rows since last trigger

Result: final operated table updated every trigger interval

Output: what part of result to write to data sink after every trigger

Complete output: Write full result table every time

Append output: Write only new rows that got added to result table since previous batch

*Not all output modes are feasible with all queries

New Model

Page 10: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Page 11: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Output  Modes▪ Append  mode  (default) -­‐ New  rows  added  to  the  Result  Table  since  the  last  trigger  will  be  outputted  to  the  sink.  Rows  will  be  output  only  once,  and  cannot  be  rescinded.

Example  use  cases:  ETL

Page 12: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Output  Modes▪ Complete  mode -­‐ The  whole  Result  Table  will  be  outputted  to  the  sink  after  every  trigger.  This  is  supported  for  aggregation  queries.

Example  use  cases:  Monitoring

Page 13: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Output  Modes▪ Update  mode -­‐ (Available  since  Spark  2.1.1)  Only  the  rows  in  the  Result  Table  that  were  updated  since  the  last  trigger  will  be  outputted  to  the  sink.

Example  use  cases:  Alerting,  Sessionization

Page 14: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Outline

oStructured  Streaming  ConceptsoStateful Processing  in  Structured  StreamingoUse  Cases  and  How  NoSQL  Stores  Fit  InoDemos

Page 15: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Event  time  Aggregations

Many  use  cases  require  aggregate  statistics  by  event  timeE.g.  what's  the  #errors  in  each  system  in  1  hour  windows?

Many  challengesExtracting  event  time  from  data,  handling  late,  out-­‐of-­‐order  data

DStream APIs  were  insufficient  for  event  time  operations

Page 16: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Event  time  Aggregations

Windowing  is  just  another  type  of  grouping  in  Struct.  Streaming

number  of  records  every  hourparsedData

.groupBy(window("timestamp","1  hour"))

.count()

parsedData.groupBy(

"device",  window("timestamp","10  mins"))

.avg("signal")

avg signal strength of each device every 10 mins

Use built-in functions to extract event-time No need for separate extractors

Page 17: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Advanced  Aggregations

Powerful  built-­‐in  aggregations

Multiple  simultaneous  aggregations

Custom  aggs using  reduceGroups,  UDAFs

parsedData.groupBy(window("timestamp","1  hour")).agg(avg("signal"),  stddev("signal"),  max("signal"))

variance,  stddev,  kurtosis,  stddev_samp,  collect_list,  collect_set,  corr,  approx_count_distinct,  ...  

//  Compute  histogram  of  age  by  name.val hist =  ds.groupBy(_.type).mapGroups {

case (type,  data:  Iter[DeviceData])  =>val buckets =  new Array[Int](10)            data.map(_.signal).foreach {  a  => buckets(a/10)+=1 }                  (type,  buckets)

}

Page 18: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Stateful Processing  for  Aggregations

In-­‐memory,  streaming  state  maintained  for  aggregations 12:00 - 13:00 1 12:00 - 13:00 3

13:00 - 14:00 1

12:00 - 13:00 3

13:00 - 14:00 2

14:00 - 15:00 5

12:00 - 13:00 5

13:00 - 14:00 2

14:00 - 15:00 5

15:00 - 16:00 4

12:00 - 13:00 3

13:00 - 14:00 2

14:00 - 15:00 6

15:00 - 16:00 4

16:00 - 17:00 3

13:00 14:00 15:00 16:00 17:00

Keeping state allows late data to update counts of old windows

But size of the state increases indefinitely if old windows not dropped

red = state updated with late data

Page 19: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Page 20: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Watermarking  and  Late  Data  

Watermark [Spark  2.1]  -­‐ a  moving  threshold  that  trails  behind  the  max  seen  event  time

Trailing  gap  defines  how  late  data  is  expected  to  be

event time

max event time

watermark data older than

watermark not expected

12:30 PM

12:20 PM

trailing gapof 10 mins

Page 21: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Watermarking  and  Late  Data

Data  newer  than  watermark  may  be  late,  but  allowed  to  aggregate

Data  older  than  watermark  is  "too  late"  and  dropped

State  older  than  watermark  automatically  deleted  to  limit  the  amount  of  intermediate  state

max event time

event time

watermark

late dataallowed to aggregate

data too late,

dropped

Page 22: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Watermarking  and  Late  Data

Control  the  tradeoff  between  state  size  and  lateness  requirements

Handle  more  late  à keep  more  stateReduce  state  à handle  less  lateness

max event time

event time

watermark

allowed latenessof 10 mins

parsedData.withWatermark("timestamp",  "10  minutes").groupBy(window("timestamp","5  minutes")).count()

late dataallowed to aggregate

data too late,

dropped

Page 23: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Watermarking  to  Limit  State  [Spark  2.1]

data too late, ignored in counts, state dropped

Processing Time12:00

12:05

12:10

12:15

12:10 12:15 12:20

12:07

12:13

12:08

Even

t Tim

e12:15

12:18

12:04

watermark updated to 12:14 - 10m = 12:04for next trigger, state < 12:04 deleted

data is late, but considered in counts

parsedData.withWatermark("timestamp",  "10  minutes").groupBy(window("timestamp","5  minutes")).count()

system tracks max observed event time

12:08

wm = 12:04

10 m

in

12:14

More details in blog post!

Page 24: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Page 25: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Working  With  Time

df.withWatermark("timestampColumn",  "5  hours").groupBy(window("timestampColumn",  "1  minute")).count().writeStream.trigger("10  seconds")

Separate processing details (output rate, late data tolerance) from query semantics.

Page 26: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Working  With  Time

df.withWatermark("timestampColumn",  "5  hours").groupBy(window("timestampColumn",  "1  minute")).count().writeStream.trigger("10  seconds")

How to groupdata by time

Same in streaming & batch

Page 27: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Working  With  Time

df.withWatermark("timestampColumn",  "5  hours").groupBy(window("timestampColumn",  "1  minute")).count().writeStream.trigger("10  seconds")

How latedata can be

Page 28: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Working  With  Time

df.withWatermark("timestampColumn",  "5  hours").groupBy(window("timestampColumn",  "1  minute")).count().writeStream.trigger("10  seconds")

How oftento emit updates

Page 29: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Arbitrary  Stateful Operations  [Spark  2.2]

mapGroupsWithStateallows  any  user-­‐definedstateful ops  to  a  user-­‐defined  state

Direct  support  for  per-­‐key  timeouts  in  event-­‐time  or  processing-­‐time

supports  Scala  and  Java

ds.groupByKey(groupingFunc).mapGroupsWithState

(timeoutConf)(mappingWithStateFunc)

def mappingWithStateFunc(key: K,  values: Iterator[V],  state: GroupState[S]): U =  {  

//  update  or  remove  state//  set  timeouts//  return  mapped  value

}

Page 30: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

flatMapGroupsWithState▪ Applies  the  given  function  to  each  group  of  data,  while  maintaining  a  user-­‐defined  per-­‐group state▪ Invoked  once  per  group  in  batch▪ Invoked  each  trigger  (with  the  existence  of  data)  per  group  in  streaming▪ Requires  user  to  provide  an  output  mode  for  the  function

Page 31: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

flatMapGroupsWithState▪ mapGroupsWithState is  a  special  case  with

oOutput  mode:  UpdateoOutput  size:  1  row  per  group

▪ Supports  both  Processing  Time  and  Event  Time  timeouts

Page 32: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Outline

oStructured  Streaming  ConceptsoStateful Processing  in  Structured  StreamingoUse  Cases and  How  NoSQL  Stores  Fit  InoDemos

Page 33: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Alerting

val monitoring  =  stream.as[Event].groupBy(_.id).flatMapGroupsWithState(Append,  GST.ProcessingTimeTimeout)  {

(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>...

}.writeStream.queryName("alerts").foreach(new  PagerdutySink(credentials))

Monitor a stream using custom stateful logic with timeouts.

Page 34: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Alerting▪ Save  your  state  to  Scylla  to  power  dashboards▪ Have  the  stream  trigger  alerts  ASAP

Page 35: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Sessionization

val monitoring  =  stream.as[Event].groupBy(_.session_id).mapGroupsWithState(GroupStateTimeout.EventTimeTimeout)  {(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>...

}.writeStream.scylla("trips")

Analyze sessions of user/system behavior

Page 36: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Sessionization▪ Update  sessions  in  your  stream▪ Save  it  to  a  NoSQL  store  like  Scylla!

Page 37: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Demo

Page 38: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Try Spark 2.2 on Community Edition today!

https://databricks.com/try-databricks

Page 39: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

Apache Spark’s Structured Streaming at Scale Series

https://databricks.com/blog/category/engineering

Twitter: @databricks

Page 40: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

We are hiring!

https://databricks.com/company/careers

Page 41: Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION  TITLE  ON  ONE  LINE  AND  ON  TWO  LINES

First  and  last  namePosition,  company

THANK  YOU

[email protected]

“Does anyone have any questions for my answers?” - Henry Kissinger