Continuous Processing with Apache Flink - Strata London 2016

Stephan Ewen@stephanewen

Continuous Processingwith Apache Flink

Streaming technology is enabling the obvious: continuous processing on data that is

continuously produced

Continuous Apps before Streaming

Scheduler

file 1

file 2

file 3 Job 3

Continuous Apps with Lambda

Scheduler

file 1

file 2

Streaming job

Continuous Apps with Streaming

collect log analyze serve & store

Continuous Data Sources

Process a period ofhistoric data

partition

Process latest datawith low latency(tail of the log)

Reprocess stream(historic data first, catches up with realtime data)

Continuous Data Sources

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

Stream of events in Apache Kafka partitions

Stream view over sequence of files

Continuous Processing

Time State

Enter Apache Flink

Apache Flink Stack

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Programs and Dataflows

Source

Transformation

val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction())

stats.addSink(new RollingSink(path))

Source[1]

map()[1]

keyBy()/window()/

apply()[1]

Sink[1]

Source[2]

map()[2]

keyBy()/window()/

apply()[2]

StreamingDataflow

What makes Flink flink?

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

(It's) About Time

Different Notions of Time

Event Producer Message Queue FlinkData Source

FlinkWindow Operator

partition 1

partition 2

EventTime Stream Processor

IngestionTime

WindowProcessing

TimeStorage

IngestionTime

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

Event Time vs. Processing Time

Batch: Implicit Treatment of Time

Time is treated outside of your application.Data is grouped by storage ingestion time.

Batch Job1h Serving

LayerBatch Job1h

Batch Job1h

Streaming: Windows

17Time

Aggregates on streamsare scoped by windows

Time-driven Data-drivene.g. last X minutes e.g. last X records

Streaming: Windows

"Average over the last 5 minutes”

Event Time Windows

Event Time Windows reorder the events to their Event Time order

Processing Time

case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(ProcessingTime)

val stream: DataStream[Event] = env.addSource(…)

stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

Ingestion Time

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(IngestionTime)

val stream: DataStream[Event] = env.addSource(…)

stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

Event Time

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(EventTime)

val stream: DataStream[Event] = env.addSource(…)val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator())

tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

The Power of Event Time Batch Processors: Event-time in ingestion-time batches

• Stable across re-executions• Wrong grouping at batch boundaries

Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays

Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries

The Power of Event Time Batch Processors: Event-time in ingestion-time batches

• Stable across re-executions• Wrong grouping at batch boundaries

Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays

Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries

Purely data-driven time

Purely wall clock time

Mix of data-driven and wall clock time

Event Time Progress: Watermarks

W(11)W(17)

11159121417122220 171921

WatermarkEvent

Event timestamp

Stream (in order)

W(11)W(20)

Watermark

991011141517

Event timestamp

1820 192123

Stream (out of order)

Bounding the Latency for Results Triggering on combinations on

Event Time and Processing Time

See previous talks by Tyler Akidau &Kenneth Knowles on Apache Beam (incub.)

Concepts apply almost 1:1 to Apache Flink Syntax varies

Matters of State

Batch vs. Continuous

• No state across batches

• Fault tolerance within a job

• Re-processing starts empty

Batch Jobs Continuous Programs

• Continuous state across time

• Fault tolerance guards state

• Reprocessing starts stateful

Continuous State

No stateless point in time

Sessions over time

Re-processing data (in batch)

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-15:00am

2016-3-16:00am

2016-3-17:00am

2016-3-14:00am

2016-3-13:00 am

Re-processing data (in batch)

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-15:00am

2016-3-16:00am

2016-3-17:00am

2016-3-14:00am

2016-3-13:00 am

Wrong / corrupt results

Streaming: Savepoints

Savepoint A Savepoint B

Globally consistent point-in-time snapshotof the streaming application

Re-processing data (continuous)

Savepoint A

Re-processing data (continuous) Draw savepoints at times that you will want to start new jobs

from (daily, hourly, …) Reprocess by starting a new job from a savepoint

• Defines start position in stream (for example Kafka offsets)• Initializes pending state (like partial sessions)

Savepoint

Run new streamingprogram from savepoint

Forking and Versioning Applications

Savepoint

App. A

App. B

App. C

Conclusion

Wrap up Streaming is the architecture for continuous processing

Continuous processing makes data applications• Simpler: Fewer moving parts• More correct: No broken state at any boundaries• More flexible: Reprocess data and fork applications via savepoints

Requires a powerful stream processor, like Apache Flink

Upcoming Features Dynamic Scaling, Resource Elasticity Stream SQL CEP enhancements Incremental & asynchronous state snapshotting Mesos support More connectors, end-to-end exactly once API enhancements (e.g., joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)

What makes Flink flink?

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Flink Forward 2016, BerlinSubmission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

We are hiring!data-artisans.com/careers

Continuous Processing with Apache Flink - Strata London 2016

Software

스사모 테크톡 - Apache Flink 둘러보기

SICS: Apache Flink Streaming

Large Scale Centrality Measures in Apache Flink and Apache ... · Programming model of Apache Giraph & Apache Flink for iterative graph processing • Apache Giraph, a vertex centric

Apache Flink Deep Dive

Apache Flink internals

Streaming Dataflow with Apache Flink

Introduction to Apache Flink

FastR+Apache Flink

Apache Flink Stream Processing

Apache Flink - SICS

Apache Flink at Strata San Jose 2016

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Streaming Analytics with Apache Flink - Meetupfiles.meetup.com/18824486/Flink @ DC Flink Meetup.pdf · Apache Flink Stack 2 DataStream API Stream Processing DataSet API Batch Processing

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Apache Flink® Training

Apache Flink – Distributed Stream Processing

Large Scale Centrality Measures in Apache Flink and Apache ... · Large Scale Centrality Measures in Apache Flink and Apache Giraph Submitted by ... Apache Flink (Runtime vs Edges)

Flink and Apache Spark Fernanda de Camargo Magano Dylan ... · Flink and Apache Spark Fernanda de Camargo Magano Dylan Guedes. About Flink ... Introduction to Apache Flink Book. Use

Apache Flink - tutorialspoint.comApache Flink was founded by Data Artisans company and is now developed under Apache License by Apache Flink Community. This community has over 479