Debunking Six Common Myths in Stream Processing

Embed Size (px)

Text of Debunking Six Common Myths in Stream Processing

PowerPoint Presentation

1Kostas Tzoumas@kostas_tzoumas

Flink London MeetupNovember 3, 2016

Apache Flink: State of the Union and What's Next

2Kostas Tzoumas@kostas_tzoumas

Flink London MeetupNovember 3, 2016Debunking Six Common Myths in Stream Processing

3Original creators of Apache Flink

Providers of the dA Platform, a supported Flink distribution

OutlineWhat is data streaming

Myth 1: The Lambda architecture

Myth 2: The throughput/latency tradeoff

Myth 3: Exactly once not possible

Myth 4: Streaming is for (near) real-time

Myth 5: Batching and buffering

Myth 6: Streaming is hard4

The streaming architecture


6Reconsideration of data architecture

Better app isolation

More real-time reaction to events

Robust continuous applications

Process both real-time and historical data


app stateapp stateapp state

event log


What is (distributed) streamingComputations on never-ending streams of data records (events)

A stream processor distributes the computation in a cluster8Your code

Your code

Your code

Your code

What is stateful streamingComputation and stateE.g., counters, windows of past events, state machines, trained ML models

Result depends on history of stream

A stateful stream processor gives the tools to manage stateRecover, roll back, version, upgrade, etc9Your code


What is event-time streamingData records associated with timestamps (time series data)

Processing depends on timestamps

An event-time stream processor gives you the tools to reason about timeE.g., handle streams that are out of orderCore feature is watermarks a clock to measure event time

10Your code




What is streamingContinuous processing on data that is continuously generated

I.e., pretty much all big data

Its all about state and time11


Myth 1: The Lambda architecture


Myth variationsStream processing is approximate

Stream processing is for transient data

Stream processing cannot handle high data volume

Hence, stream processing needs to be coupled with batch processing14

Lambda architecture15

file 1file 2Job 1Job 2Scheduler

Streaming job

Serve & store

Lambda no longer neededLambda was useful in the first days of stream processing (beginning of Apache Storm)

Not any moreStream processors can handle very large volumesStream processors can compute accurate results

Good news is I dont hear Lambda so often anymore16

Myth 2: Throughput/latency tradeoff


Myth flavorsLow latency systems cannot support high throughput

In general, you need to trade off one for the other

There is a high throughput category and a low-latency category (naming varies)18

Physical limitsMost stream processing pipelines are network bottlenecked

The network dictates both (1) what is the latency and (2) what is the throughput

A well-engineered system achieves the physical limits allowed by the network19

BufferingIt is natural to handle many records togetherAll software and hardware systems do thatE.g., network bundles bytes into frames

Every streaming system buffers records for performance (Flink certainly does)You dont want to send single records over the network"Record-at-a-time" does not exist at the physical level


Buffering (2)Buffering is a performance optimizationShould be opaque to the userShould not dictate system behavior in any other wayShould not impose artificial boundariesShould not limit what you can do with the systemEtc...


Some numbers22

Some more23

TeraSortRelational JoinClassic Batch JobsGraphProcessingLinearAlgebra

Myth 3: Exactly once not possible


What is exactly onceUnder failures, system computes result as if there was no failure

In contrast to:At most once: no guaranteesAt least once: duplicates possible

Exactly once state versus exactly once delivery25

Myth variationsExactly once is not possible in nature

Exactly once is not possible end-to-end

Exactly once is not needed

You need to trade off performance for exactly once

(Usually perpetuated by folks until they implement exactly once )


TransactionsExactly once is transactions: either all actions succeed or none succeed

Transactions are possible

Transactions are useful

Lets not start eventual consistency all over again27

Flink checkpointsPeriodic asynchronous consistent snapshots of application state

Provide exactly-once state guarantees under failures28

End-to-end exactly onceCheckpoints double as transaction coordination mechanism

Source and sink operators can take part in checkpoints

Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates


transactional sinks

State managementCheckpoints triple as state versioning mechanism (savepoints)

Go back and forth in time while maintaining state consistency

Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests30

Myth 4: Streaming = real time


Myth variationsI dont have low latency applications hence I dont need stream processing

Stream processing is only relevant for data before storing them

We need a batch processor to do heavy offline computations


Low latency and high latency streams33

2016-3-112:00 am2016-3-11:00 am2016-3-12:00 am2016-3-1111:00pm2016-3-1212:00am2016-3-121:00am2016-3-1110:00pm2016-3-122:00am2016-3-123:00am

partitionpartitionStream (low latency)

Batch(bounded stream)

Stream (high latency)

Robust continuous applications34

Accurate computationBatch processing is not an accurate computation model for continuous dataMisses the right concepts and primitivesTime handling, state across batch boundaries

Stateful stream processing a better modelReal-time/low-latency is the icing on the cake35

Myth 5: Batching and buffering


Myth variationsThere is a "mini-batch" category between batch and streaming

Record-at-a-time versus mini-batching or similar "choices"

Mini-batch systems can get better throughput


Myth variations (2)The difference between mini-batching and streaming is latency

I dont need low latency hence I need mini-batching

I have a mini-batching use case38

We have answered this alreadyCan get throughput and latency (myth #2)Every system buffers data, from the network to the OS to Flink

Streaming is a model, not just fast (myth #4)Time and stateLow latency is the icing on the cake39

Continuous operationData is continuously produced

Computation should track data productionWith dynamic scaling, pause-and-resume

Restarting our pipelines every second is not a great idea, and not just for latency reasons40

Myth 6: Streaming is hard


Myth variationsStreaming is hard to learn

Streaming is hard to reason about

Windows? Event time? Triggers? Oh, my!!

Streaming needs to be coupled with batch

I know batch already


It's about your data and codeWhat's the form of your data?Unbounded (e.g., clicks, sensors, logs), orBounded (e.g., ???*)

What changes more often?My code changes faster than my dataMy data changes faster than my code43* Please help me find a great example of naturally static data

It's about your data and codeIf your data changes faster than your code you have a streaming problemYou may be solving it with hourly batch jobs depending on someone else to create the hourly batchesYou are probably living with inaccurate results without knowing it44

It's about your data and codeIf your code changes faster than your data you have an exploration problemUsing notebooks or other tools for quick data exploration is a good ideaOnce your code stabilizes you will have a streaming problem, so you might as well think of it as such from the beginning45

Flink in the real world


Flink community> 240 contributors, 95 contributors in Flink 1.1

42 meetups around the world with > 15,000 members

2x-3x growth in 2015, similar in 2016


Powered by Flink48

Zalando, one of the largest ecommerce companies in Europe, uses Flink for real-time business process monitoring.

King, the creators of Candy Crush Saga, uses Flink to provide data science teams with real-time analytics.

Bouygues Telecom uses Flink for real-time event processing over billions of Kafka messages per day.

Alibaba, the world's largest retailer, built a Flink-based system (Blink) to optimize search rankings in real time. See more at

30 Flink applications in production for more than one year. 10 billion events (2TB) processed dailyComplex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guaranteesLargest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second49


Flink Forward 2016

Current work in Flink


Flink's unique combination of features53Low latencyHigh ThroughputWell-behavedflow control(back pressure)ConsistencyWorks on real