Continuous Processing with Apache Flink - Strata London 2016

  • View
    9.954

  • Download
    0

  • Category

    Software

Preview:

Citation preview

Stephan Ewen@stephanewen

Continuous Processingwith Apache Flink

2

Streaming technology is enabling the obvious: continuous processing on data that is

continuously produced

Continuous Apps before Streaming

3time

Scheduler

file 1

file 2

Job 1

Job 2

Serv

ing

file 3 Job 3

Continuous Apps with Lambda

4

Scheduler

file 1

file 2

Job 1

Job 2

Serv

ing

Streaming job

Continuous Apps with Streaming

5

collect log analyze serve & store

Continuous Data Sources

6

Process a period ofhistoric data

partition

partition

Process latest datawith low latency(tail of the log)

Reprocess stream(historic data first, catches up with realtime data)

Continuous Data Sources

7

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream of events in Apache Kafka partitions

Stream view over sequence of files

Continuous Processing

Time State

9

Enter Apache Flink

Apache Flink Stack

10

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Programs and Dataflows

11

Source

Transformation

Transformation

Sink

val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction())

stats.addSink(new RollingSink(path))

Source[1]

map()[1]

keyBy()/window()/

apply()[1]

Sink[1]

Source[2]

map()[2]

keyBy()/window()/

apply()[2]

StreamingDataflow

What makes Flink flink?

12

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

13

(It's) About Time

Different Notions of Time

14

Event Producer Message Queue FlinkData Source

FlinkWindow Operator

partition 1

partition 2

EventTime Stream Processor

IngestionTime

WindowProcessing

TimeStorage

IngestionTime

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

Event Time vs. Processing Time

15

Batch: Implicit Treatment of Time

16

Time is treated outside of your application.Data is grouped by storage ingestion time.

Batch Job1h Serving

LayerBatch Job1h

Batch Job1h

Streaming: Windows

17Time

Aggregates on streamsare scoped by windows

Time-driven Data-drivene.g. last X minutes e.g. last X records

Streaming: Windows

18

Time

"Average over the last 5 minutes”

Event Time Windows

19

Event Time Windows reorder the events to their Event Time order

Processing Time

20

case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(ProcessingTime)

val stream: DataStream[Event] = env.addSource(…)

stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

Ingestion Time

21

case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(IngestionTime)

val stream: DataStream[Event] = env.addSource(…)

stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

Event Time

22

case class Event(id: String, measure: Double, timestamp: Long)

val env = StreamExecutionEnvironment.getExecutionEnvironmentenv.setStreamTimeCharacteristic(EventTime)

val stream: DataStream[Event] = env.addSource(…)val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator())

tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")

The Power of Event Time Batch Processors: Event-time in ingestion-time batches

• Stable across re-executions• Wrong grouping at batch boundaries

Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays

Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries

23

The Power of Event Time Batch Processors: Event-time in ingestion-time batches

• Stable across re-executions• Wrong grouping at batch boundaries

Traditional Stream Processors: Processing time• Results depend on when the program runs (different on re-execution)• Results affected by network speed and delays

Event-Time Stream Processors: Event time• Stable across re-executions• No incorrect results at batch boundaries

24

Purely data-driven time

Purely wall clock time

Mix of data-driven and wall clock time

Event Time Progress: Watermarks

25

7

W(11)W(17)

11159121417122220 171921

WatermarkEvent

Event timestamp

Stream (in order)

7

W(11)W(20)

Watermark

991011141517

Event

Event timestamp

1820 192123

Stream (out of order)

Bounding the Latency for Results Triggering on combinations on

Event Time and Processing Time

See previous talks by Tyler Akidau &Kenneth Knowles on Apache Beam (incub.)

Concepts apply almost 1:1 to Apache Flink Syntax varies

26

27

Matters of State

Batch vs. Continuous

28

• No state across batches

• Fault tolerance within a job

• Re-processing starts empty

Batch Jobs Continuous Programs

• Continuous state across time

• Fault tolerance guards state

• Reprocessing starts stateful

Continuous State

29

time

No stateless point in time

Sessions over time

Re-processing data (in batch)

30

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-15:00am

2016-3-16:00am

2016-3-17:00am

2016-3-14:00am

2016-3-13:00 am

Re-processing data (in batch)

31

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-15:00am

2016-3-16:00am

2016-3-17:00am

2016-3-14:00am

2016-3-13:00 am

Wrong / corrupt results

Streaming: Savepoints

32

Savepoint A Savepoint B

Globally consistent point-in-time snapshotof the streaming application

Re-processing data (continuous)

33

Savepoint A

Re-processing data (continuous) Draw savepoints at times that you will want to start new jobs

from (daily, hourly, …) Reprocess by starting a new job from a savepoint

• Defines start position in stream (for example Kafka offsets)• Initializes pending state (like partial sessions)

34

Savepoint

Run new streamingprogram from savepoint

Forking and Versioning Applications

35

Savepoint

Savepoint

Savepoint

Savepoint

App. A

App. B

App. C

36

Conclusion

Wrap up Streaming is the architecture for continuous processing

Continuous processing makes data applications• Simpler: Fewer moving parts• More correct: No broken state at any boundaries• More flexible: Reprocess data and fork applications via savepoints

Requires a powerful stream processor, like Apache Flink

37

Upcoming Features Dynamic Scaling, Resource Elasticity Stream SQL CEP enhancements Incremental & asynchronous state snapshotting Mesos support More connectors, end-to-end exactly once API enhancements (e.g., joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)

38

What makes Flink flink?

39

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Flink Forward 2016, BerlinSubmission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

We are hiring!data-artisans.com/careers

Recommended