25
WINDOWING DATA IN BIG DATA STREAMS ADAM WARSKI, WOLVESSUMMIT

Windowing data in big data streams

Embed Size (px)

Citation preview

WINDOWING DATA IN BIG DATA STREAMS

ADAM WARSKI, WOLVESSUMMIT

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

BIG DATA? FAST DATA?

▸ What is big data?

▸ Shift of focus

▸ Processing speed

▸ Fast data -> streaming

A TYPE OF DATA PROCESSING ENGINE THAT IS DESIGNED WITH INFINITE DATA SETS IN MIND

Tyler Akidau, Google

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WHAT IS STREAMING?

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WINDOWING

▸ Time becomes the focus point

▸ How many invalid password errors where there in the last 5 minutes

▸ During which 30-minute window did we get most traffic?

▸ What’s the average 5-minute speed on a section of a highway throughout the day?

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

HOW TO DO STREAMING? WITH WINDOWS?

▸ Many possibilities:

▸ Spark Streaming

▸ Spark Structured Streaming

▸ Kafka Streams

▸ Flink

▸ Akka Streams

▸ …

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WHICH ONE TO CHOOSE?

LET’S FIND OUT

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

/ME

▸ coder @

▸ Lightbend, Confluent, Datastax consulting partner

▸ mainly Scala

▸ open-source: MacWire, ElasticMQ, Quicklens, …

▸ http://www.warski.org / @adamwarski

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WHAT’S THE TIME?

▸ How to associate time with an event:

▸ event time: “logical”, data-dependent

▸ ingestion time: when the event entered the system

▸ processing time: when the event is being processed

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

TYPES OF WINDOWS

▸ Time-based

▸ fixed/tumbling

▸ sliding

▸ Session-based

time

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

OUT-OF-ORDER: WATERMARKS, LATENESS

▸ Windows GC

▸ At some point, enough is enough

▸ Watermark:

▸ all events before X have been observed

▸ heuristics

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

TRIGGERS

▸ When to emit window results

▸ Watermark progress

▸ Event time progress

▸ Processing time progress

▸ Punctuations

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

ACCUMULATION OF RESULTS

▸ If we trigger many times …

▸ discard

▸ accumulate

▸ retract & accumulate

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

FINALLY … HOW TO MANIPULATE THE DATA

▸ map, flatMap, filter …

▸ stateful computation

▸ fold, reduce

▸ past-dependent operations

▸ where to store the state

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Event/ingestion/processing time

▸ Tumbling/sliding/session windows

▸ Watermarks

▸ Triggers

▸ Accumulation of results

▸ State management

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SPARK STREAMING

▸ Micro-batches (DStream)

▸ .window() API:

▸ tumbling/sliding windows

▸ only processing time

▸ no watermarks

▸ triggers at the end of the window

▸ state persisted in cluster (e.g. updateStateByKey())

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SPARK STREAMING - WHY BOTHER?

▸ Popular

▸ Not only streaming

▸ ML

▸ SQL

▸ GraphX

▸ but …

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SPARK STRUCTURED STREAMING

▸ Alpha in Spark 2.0

▸ Micro-batches not exposed

▸ groupBy(window(…))

▸ Event-time support

▸ No watermarks, session windows (2.1?)

▸ Trigger: processing time; outputs changed windows

▸ Exactly-once processing*

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

FLINK

▸ Mostly with keyed streams (parallelism)

▸ TimeCharacteristic: event/ingestion/processing

▸ TimestampAssigner: also generates watermarks

▸ WindowAssigner: arbitrary, built-in tumbling, sliding, session

▸ Trigger: event/processing time, count, single/continuous

▸ Window function: fold/reduce/with-kv-state

▸ Exactly-once* / at-least-once

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

KAFKA STREAMS

▸ State: Kafka topics/local key-value backed by a topic for resiliency

▸ Watermarks: no, but windows are retained for 1 day

▸ Time: event/ingestion/processing; TimestampExtractor

▸ Tumbling/sliding windows

▸ Trigger: after every element

▸ aggregate by key&window into an ever-updating KTable

▸ At-least-once

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

AKKA STREAMS

▸ Single-node, no clustering

▸ No OOTB support, but quite easy to implement:

▸ Windows: arbitrary, assign windows to each element

▸ Trigger: only window-close

▸ State: local

▸ Watermarks: can be implemented

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Spark: widely used, some features missing

▸ Flink: versatile

▸ Kafka: simple model

▸ Akka: single-node

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Windowing is just one of the aspects

▸ Other:

▸ State management

▸ Work distribution

▸ Processing guarantees

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Other stream processing systems out there!

▸ Apache Storm

▸ Google Cloud Dataflow

▸ Amazon Kinesis

▸ Apache Beam

▸ …

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

LINKS

▸ Streaming 101 & 102: 

▸ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

▸ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

▸ https://softwaremill.com/windowing-data-in-akka-streams/

THANKS!

ADAM WARSKI

@ADAMWARSKI / [email protected]