Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

PRESENTATION TITLE ON ONE LINE AND ON TWO LINES

First and last namePosition, company

Arbitrary Stateful Aggregationsusing Structured Streaming

in Apache Spark™

Software Engineer, Databricks

Burak Yavuz



Burak Yavuz

2

●Software Engineer – Databricks-‐ “We make your streams come true”●Apache Spark Committer as of Feb 2017●MS in Management Science & Engineering -‐Stanford University●BS in Mechanical Engineering -‐ Bogazici University, Istanbul



TEAM

About

Started Spark project (now Apache Spark) at UC Berkeley in 2009

PRODUCTUnified Analytics Platform

MISSIONMaking Big Data Simple



Outline

oStructured Streaming ConceptsoStateful Processing in Structured StreamingoUse Cases and How NoSQL Stores Fit InoDemos



The simplest way to perform streaming analyticsis not having to reason about streaming at all





New ModelInput: data from source as an append-‐only table

Trigger: how frequently to checkinput for new data

Query: operations on inputusual map/filter/reduce new window, session ops

Trigger: every 1 sec

1 2 3Time

data upto 1

Input data upto 2

data upto 3

Quer

y




1 2 3

result for data up to 1

Result

Quer

y

Time

data upto 1

Input data upto 2


data upto 3


Output[complete mode]

output all the rows in the result table

New Model

Result: final operated table updated every trigger interval

Output: what part of result to write to data sink after every trigger

Complete output: Write full result table every time




1 2 3


Result

Quer

y

Time

data upto 1

Input data upto 2


data upto 3


Output[append mode]

output only new rows since last trigger

Result: final operated table updated every trigger interval

Output: what part of result to write to data sink after every trigger

Complete output: Write full result table every time

Append output: Write only new rows that got added to result table since previous batch

*Not all output modes are feasible with all queries

New Model





Output Modes▪ Append mode (default) -‐ New rows added to the Result Table since the last trigger will be outputted to the sink. Rows will be output only once, and cannot be rescinded.

Example use cases: ETL



Output Modes▪ Complete mode -‐ The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.

Example use cases: Monitoring



Output Modes▪ Update mode -‐ (Available since Spark 2.1.1) Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink.

Example use cases: Alerting, Sessionization



Outline




Event time Aggregations

Many use cases require aggregate statistics by event timeE.g. what's the #errors in each system in 1 hour windows?

Many challengesExtracting event time from data, handling late, out-‐of-‐order data

DStream APIs were insufficient for event time operations



Event time Aggregations

Windowing is just another type of grouping in Struct. Streaming

number of records every hourparsedData

.groupBy(window("timestamp","1 hour"))

.count()

parsedData.groupBy(

"device", window("timestamp","10 mins"))

.avg("signal")

avg signal strength of each device every 10 mins

Use built-in functions to extract event-time No need for separate extractors



Advanced Aggregations

Powerful built-‐in aggregations

Multiple simultaneous aggregations

Custom aggs using reduceGroups, UDAFs

parsedData.groupBy(window("timestamp","1 hour")).agg(avg("signal"), stddev("signal"), max("signal"))

variance, stddev, kurtosis, stddev_samp, collect_list, collect_set, corr, approx_count_distinct, ...

// Compute histogram of age by name.val hist = ds.groupBy(_.type).mapGroups {

case (type, data: Iter[DeviceData]) =>val buckets = new Array[Int](10) data.map(_.signal).foreach { a => buckets(a/10)+=1 } (type, buckets)

}



Stateful Processing for Aggregations

In-‐memory, streaming state maintained for aggregations 12:00 - 13:00 1 12:00 - 13:00 3

13:00 - 14:00 1

12:00 - 13:00 3

13:00 - 14:00 2

14:00 - 15:00 5

12:00 - 13:00 5

13:00 - 14:00 2

14:00 - 15:00 5

15:00 - 16:00 4

12:00 - 13:00 3

13:00 - 14:00 2

14:00 - 15:00 6

15:00 - 16:00 4

16:00 - 17:00 3

13:00 14:00 15:00 16:00 17:00

Keeping state allows late data to update counts of old windows

But size of the state increases indefinitely if old windows not dropped

red = state updated with late data





Watermarking and Late Data

Watermark [Spark 2.1] -‐ a moving threshold that trails behind the max seen event time

Trailing gap defines how late data is expected to be

event time

max event time

watermark data older than

watermark not expected

12:30 PM

12:20 PM

trailing gapof 10 mins




Data newer than watermark may be late, but allowed to aggregate

Data older than watermark is "too late" and dropped

State older than watermark automatically deleted to limit the amount of intermediate state

max event time

event time

watermark

late dataallowed to aggregate

data too late,

dropped




Control the tradeoff between state size and lateness requirements

Handle more late à keep more stateReduce state à handle less lateness

max event time

event time

watermark

allowed latenessof 10 mins

parsedData.withWatermark("timestamp", "10 minutes").groupBy(window("timestamp","5 minutes")).count()

late dataallowed to aggregate

data too late,

dropped



Watermarking to Limit State [Spark 2.1]

data too late, ignored in counts, state dropped

Processing Time12:00

12:05

12:10

12:15

12:10 12:15 12:20

12:07

12:13

12:08

Even

t Tim

e12:15

12:18

12:04

watermark updated to 12:14 - 10m = 12:04for next trigger, state < 12:04 deleted

data is late, but considered in counts

parsedData.withWatermark("timestamp", "10 minutes").groupBy(window("timestamp","5 minutes")).count()

system tracks max observed event time

12:08

wm = 12:04

10 m

in

12:14

More details in blog post!





Working With Time

df.withWatermark("timestampColumn", "5 hours").groupBy(window("timestampColumn", "1 minute")).count().writeStream.trigger("10 seconds")

Separate processing details (output rate, late data tolerance) from query semantics.



Working With Time


How to groupdata by time

Same in streaming & batch



Working With Time


How latedata can be



Working With Time


How oftento emit updates



Arbitrary Stateful Operations [Spark 2.2]

mapGroupsWithStateallows any user-‐definedstateful ops to a user-‐defined state

Direct support for per-‐key timeouts in event-‐time or processing-‐time

supports Scala and Java

ds.groupByKey(groupingFunc).mapGroupsWithState

(timeoutConf)(mappingWithStateFunc)

def mappingWithStateFunc(key: K, values: Iterator[V], state: GroupState[S]): U = {

// update or remove state// set timeouts// return mapped value

}



flatMapGroupsWithState▪ Applies the given function to each group of data, while maintaining a user-‐defined per-‐group state▪ Invoked once per group in batch▪ Invoked each trigger (with the existence of data) per group in streaming▪ Requires user to provide an output mode for the function



flatMapGroupsWithState▪ mapGroupsWithState is a special case with

oOutput mode: UpdateoOutput size: 1 row per group

▪ Supports both Processing Time and Event Time timeouts



Outline




Alerting

val monitoring = stream.as[Event].groupBy(_.id).flatMapGroupsWithState(Append, GST.ProcessingTimeTimeout) {

(id: Int, events: Iterator[Event], state: GroupState[…]) =>...

}.writeStream.queryName("alerts").foreach(new PagerdutySink(credentials))

Monitor a stream using custom stateful logic with timeouts.



Alerting▪ Save your state to Scylla to power dashboards▪ Have the stream trigger alerts ASAP



Sessionization

val monitoring = stream.as[Event].groupBy(_.session_id).mapGroupsWithState(GroupStateTimeout.EventTimeTimeout) {(id: Int, events: Iterator[Event], state: GroupState[…]) =>...

}.writeStream.scylla("trips")

Analyze sessions of user/system behavior



Sessionization▪ Update sessions in your stream▪ Save it to a NoSQL store like Scylla!



Demo



Try Spark 2.2 on Community Edition today!

https://databricks.com/try-databricks



Apache Spark’s Structured Streaming at Scale Series

https://databricks.com/blog/category/engineering

Twitter: @databricks



We are hiring!

https://databricks.com/company/careers



THANK YOU

[email protected]

“Does anyone have any questions for my answers?” - Henry Kissinger

Technology

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark