51
1 Kostas Kloudas @KLOUBEN_K Flink Forward San Francisco April 11, 2017 Extending Flink’s Streaming APIs

Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

1

Kostas Kloudas@KLOUBEN_K

Flink Forward San Francisco April 11, 2017

Extending Flink’s Streaming APIs

Page 2: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

2

Original creators of Apache Flink®

Providers of the dA Platform, a supported

Flink distribution

Page 3: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Extensions to the DataStream API

3

Page 4: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Extensions to the DataStream API

4

▪ ProcessFunction for Low-level Operations

▪ Support for Asynchronous I/O

Page 5: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

ProcessFunction

5

Page 6: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Stream Processing

6

Computation

Computations on never-ending

“streams” of events

Page 7: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Distributed Stream Processing

7

Computation

Computation spread across

many machinesComputation Computation

Page 8: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Stateful Stream Processing

8

Computation

State

Result depends on history of

stream

Page 9: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Stream Processing Engines

▪ Time: • handle infinite streams • with out-of-order events

▪ State: • guarantee fault-tolerance (distributed) • guarantee consistency (infinite streams)

9

Page 10: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Gives access to all basic building blocks: • Events • Fault-tolerant, Consistent State • Timers (event- and processing-time) • Side Outputs

10

ProcessFunction

Page 11: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Common Usecase Skeleton A

▪ On each incoming element: • update some state • register a callback for a moment in the future

▪ When that moment comes: • Check a condition and perform a certain

action, e.g. emit an element11

Page 12: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Use built-in windowing: • +Expressive • +A lot of functionality out-of-the-box • - Not always intuitive • - An overkill for simple cases

▪ Write your own operator: • - Too many things to account for

12

Before the ProcessFunction

Page 13: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Simple yet powerful API:

13

/** * Process one element from the input stream. */void processElement(I value, Context ctx, Collector<O> out) throws Exception;

/** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;

ProcessFunction

Page 14: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Simple yet powerful API:

14

/** * Process one element from the input stream. */void processElement(I value, Context ctx, Collector<O> out) throws Exception;

/** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;

A collector to emit result values

ProcessFunction

Page 15: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Simple yet powerful API:

15

/** * Process one element from the input stream. */void processElement(I value, Context ctx, Collector<O> out) throws Exception;

/** * Called when a timer set using {@link TimerService} fires. */ void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception;

1. Get the timestamp of the element 2. Register and use side outputs 3. Interact with the TimerService to:

• query the current time • register timers

1. Do the above 2. Query if we are on Event or

Processing time

ProcessFunction

Page 16: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Requirements: • maintain counts per incoming key, and • emit the key/count pair if no element came

for the key in the last 100 ms (in event time)

16

ProcessFunction: example

Page 17: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

17

▪ Implementation sketch: • Store the count, key and last mod timestamp in

a ValueState (scoped by key) • For each record:

• update the counter and the last mod timestamp • register a timer 100ms from “now” (in event time)

• When the timer fires: • check the timer’s timestamp against the last mod time for that key

and • emit the key/count pair if they differ by 100ms

ProcessFunction: example

Page 18: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

18

public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {

// define your state descriptors

@Override public void processElement(Tuple2<String, Long> value, Context ctx,

Collector<Tuple2<String, Long>> out) throws Exception {// update our state and register a timer

}

@Override

public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception {

// check the state for the key and emit a result if needed}

}

ProcessFunction: example

Page 19: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

19

public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {

// define your state descriptors private final ValueStateDescriptor<CounterWithTS> stateDesc =

new ValueStateDescriptor<>("myState", CounterWithTS.class);

}

ProcessFunction: example

Page 20: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

20

public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {

@Override public void processElement(Tuple2<String, String> value, Context ctx,

Collector<Tuple2<String, Long>> out) throws Exception {

ValueState<MyStateClass> state = getRuntimeContext().getState(stateDesc);

CounterWithTS current = state.value(); if (current == null) {

current = new CounterWithTS(); current.key = value.f0;

} current.count++; current.lastModified = ctx.timestamp();state.update(current);ctx.timerService().registerEventTimeTimer(current.lastModified + 100);

} }

ProcessFunction: example

Page 21: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

21

public class MyProcessFunction extends ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>> {

@Override public void onTimer(long timestamp, OnTimerContext ctx,

Collector<Tuple2<String, Long>> out) throws Exception { CounterWithTS result = getRuntimeContext().getState(stateDesc).value();

if (timestamp == result.lastModified + 100) { out.collect(new Tuple2<String, Long>(result.key, result.count));

} }

}

ProcessFunction: example

Page 22: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

22

stream.keyBy(”key”).process(new MyProcessFunction())

ProcessFunction: example

Page 23: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

ProcessFunction: Side Outputs

▪ Additional (to the main) output streams

▪ No type limitations • each side output can have its own type

23

Page 24: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Requirements: • maintain counts per incoming key, and • emit the key/count pair if no element came

for the key in the last 100 ms (in event time) • in other case, if the count > 10, send the key

to a side-output named gt10

24

ProcessFunction: example+

Page 25: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

25

final OutputTag<String> outputTag = new OutputTag<String>(”gt10"){};

SingleOutputStreamOperator<Tuple2<String, Long>> mainStream = input.process(new ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>>() {

@Override

public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception {

CounterWithTS result = getRuntimeContext().getState(adStateDesc).value();

if (timestamp == result.lastModified + 100) { out.collect(new Tuple2<String, Long>(result.key, result.count));

} else if (result.count > 10) { ctx.output(outputTag, result.key); }

}

DataStream<String> sideOutputStream = mainStream.getSideOutput(outputTag);

ProcessFunction: example+

Page 26: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

26

final OutputTag<String> outputTag = new OutputTag<String>(”gt10"){};

SingleOutputStreamOperator<Tuple2<String, Long>> mainStream = input.process(new ProcessFunction<Tuple2<String, String>, Tuple2<String, Long>>() {

@Override

public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out) throws Exception {

CounterWithTS result = getRuntimeContext().getState(adStateDesc).value();

if (timestamp == result.lastModified + 100) { out.collect(new Tuple2<String, Long>(result.key, result.count));

} else if (result.count > 10) { ctx.output(outputTag, result.key); }

}

DataStream<String> sideOutputStream = mainStream.getSideOutput(outputTag);

ProcessFunction: example+

Page 27: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

27

▪ Applicable to Keyed streams ▪ For Non-Keyed streams: ▪ group on a dummy key if you need the timers

▪ BEWARE: parallelism of 1

▪ Use it directly without the timers ▪ CoProcessFunction for low-level joins:

• Applied on two input streams

ProcessFunction

Page 28: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Asynchronous I/O

28

Page 29: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Common Usecase Skeleton B

29

▪ On each incoming element: • extract some info from the element (e.g. key) • query an external storage system (DB or KV-

store) for additional info • emit an enriched version of the input element

Page 30: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Write a MapFunction that queries the DB: • +Simple • - Slow (synchronous access) or/and • - Requires high parallelism (more tasks)

▪ Write your own operator: • - Too many things to account for

30

Before the AsuncIO support

Page 31: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Write a MapFunction that queries the DB: • +Simple • - Slow (synchronous access) or/and • - Requires high parallelism (more tasks)

▪ Write your own operator: • - Too many things to account for

31

Before the AsyncIO support

Page 32: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

32

Synchronous Access

Page 33: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

33

Communication delay can dominate application

throughput and latency

Synchronous Access

Page 34: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

34

Asynchronous Access

Page 35: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Requirement: • a client that supports asynchronous requests

▪ Flink handles the rest: • integration of async IO with DataStream API • fault-tolerance • order of emitted elements • correct time semantics (event/processing time)

35

AsyncFunction

Page 36: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

▪ Simple API: /** * Trigger async operation for each stream input. */ void asyncInvoke(IN input, AsyncCollector<OUT> collector) throws Exception;

▪ API call: /** * Example async function call. */ DataStream<...> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);

36

AsyncFunction

Page 37: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

37

Emitter

P2P3 P1P4

AsyncWaitOperatorE5

AsyncWaitOperator: • a queue of “Promises” • a separate thread (Emitter)

AsyncFunction

Page 38: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

38

Emitter

P2P3 P1P4

AsyncWaitOperator • Wrap E5 in a “promise” P5 • Put P5 in the queue • Call asyncInvoke(E5, P5)

E5

P5

asyncInvoke(E5, P5)P5

AsyncFunction

Page 39: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

39

Emitter

P2P3 P1P4

AsyncWaitOperatorE5

P5

asyncInvoke(E5, P5)P5

asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the

result (when the query returns)

AsyncFunction

Page 40: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

40

Emitter

P2P3 P1P4

AsyncWaitOperatorE5

P5

asyncInvoke(E5, P5)P5

asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the

result (when the query returns)

Future<String> future = client.query(E5);

future.thenAccept((String result) -> { P5.collect( Collections.singleton( new Tuple2<>(E5, result))); });

AsyncFunction

Page 41: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

41

Emitter

P2P3 P1P4

AsyncWaitOperatorE5

P5

asyncInvoke(E5, P5)P5

asyncInvoke(value, asyncCollector): • a user-defined function • value : the input element • asyncCollector : the collector of the

result (when the query returns)

Future<String> future = client.query(E5);

future.thenAccept((String result) -> { P5.collect( Collections.singleton( new Tuple2<>(E5, result))); });

AsyncFunction

Page 42: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

42

Emitter

P2P3 P1P4

AsyncWaitOperatorE5

P5

asyncInvoke(E5, P5)P5

Emitter: • separate thread • polls queue for completed

promises (blocking) • emits elements downstream

AsyncFunction

Page 43: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

43

DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);

▪ our asyncFunction ▪ a timeout: max time until considered failed ▪ capacity: max number of in-flight requests

AsyncFunction

Page 44: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

44

DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);

AsyncFunction

Page 45: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

45

DataStream<Tuple2<String, String>> result = AsyncDataStream.(un)orderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);

P2P3 P1P4E2E3 E1E4

Ideally... Emitter

AsyncFunction

Page 46: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

46

DataStream<Tuple2<String, String>> result = AsyncDataStream.unorderedWait(stream, new MyAsyncFunction(), 1000, TimeUnit.MILLISECONDS, 100);

P2P3 P1P4E2E3 E1E4

Reallistically... Emitter

...output ordered based on which request finished first

AsyncFunction

Page 47: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

47

P2P3 P1P4E2E3 E1E4

Emitter

▪ unorderedWait: emit results in order of completion ▪ orderedWait: emit results in order of arrival

▪ Always: watermarks never overpass elements and vice versa

AsyncFunction

Page 48: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

Documentation

▪ ProcessFunction: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/

process_function.html https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/

process_function.html

▪ AsyncIO: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html

48

Page 49: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

49

Thank you!@KLOUBEN_K @ApacheFlink @dataArtisans

Page 50: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

50

Page 51: Extending Flink’s Streaming APIs...Extending Flink’s Streaming APIs. 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution. ... Support

We are hiring!

data-artisans.com/careers